From 2D computer vision to the incorporation of deep learning models, computer vision has a deep history of developments that continue to expand its uses.
We’ve heard a lot recently about how Chat GPT-3 has demonstrated the latest useful capabilities of AI in conversations. But how can AI help us see and analyse the world around us?
​
We used to think of analyzing images and videos as a task only humans can do. We see this in the way Captcha tests if the user is a human. But, what could we do with a computer that can analyze images and videos? Computer vision, a subfield of artificial intelligence that has evolved significantly since its early days, aims to bring technology that can accurately analyze images and videos like a human can. Since its inception in the 1960s, it has come a long way. We went from 2D and 3D computer vision to the latest breakthroughs in deep learning.
​
Early Days of Computer Vision: Developing 2D Capabilities
Computer vision has its roots in the early days of computer science, where researchers focused on developing algorithms that could extract information from 2D images. The goal was to create computer programs that could perform tasks like image segmentation, object detection, and image recognition. These techniques were the foundation of computer vision and laid the groundwork for the development of more advanced methods in the future.
​
In the early days there was a lot of focus on image processing such as edge detection, or finding other ways to extract “features” of the image. This technique involves identifying the boundaries of objects in an image by detecting edges, which are the points where the brightness or color of the image changes abruptly. In addition to edges people tried to look for “corners” or distinctive features that could be tracked between images, for instance identifying unique patterns or characteristics in an image, such as corners or lines. Feature extraction was essential for developing algorithms that could recognize objects in images. David Lowe's Scale-Invariant Feature Transform (SIFT) algorithm, which revolutionized feature extraction in computer vision, was a key milestone in the field's history. It was published in 1999 and is still widely used in many computer vision applications.
​
Other key milestones in 2D computer vision history include the development of the first optical character recognition (OCR) system in the 1950s and the development of the first facial recognition system in the 1960s. These early systems laid the groundwork for the development of more advanced computer vision applications in the future. However for all of these works the features (be it edges or corners) were hand designed. A key aspect of the deep learning revolution, which we will discuss later, is the transition from man made to computer learnt features. In fact pre the deep learning era one can think of the researcher hand crating ever more sophisticated hand crafted features a little like medieval craftsman who would spend a lifetime making an artifact, all to be swept away by the industrial revolution to come.
Advancements in Computer Vision: From 2D to 3D
As computer vision technology advanced, researchers began exploring ways to extend its capabilities beyond 2D images and into the realm of 3D. 3D computer vision involves using algorithms and techniques to extract 3D information from 2D images or other sensory data.
One of the key developments in 3D computer vision was the development of structure-from-motion (SfM) techniques. SfM is a process that involves analyzing a series of 2D images to reconstruct the 3D geometry of a scene and the camera poses that produced the images. This technique has found widespread use in applications such as autonomous driving, robotics, and virtual and augmented reality.
​
In 1998, Computer Vision researcher Phil Torr and colleagues, Professor at the University of Oxford & Leader of the Torr Vision Group, won the Marr Prize for their work on the robust recovery of shape of an object or scene from 2D images to 3D. This body of work had a significant impact on the field of computer vision and is widely used in a range of applications. The work underpinned the algorithm design for Boujou, a software developed by a start-up 2D3. This software played a crucial role in camera tracking and 3D reconstruction for film and video post production including blockbusters like Harry Potter and the Sorcerer's Stone and The Lord of the Rings: The Fellowship of the Ring, as well as TV shows and commercials. Boujou uses advanced computer vision algorithms to track the motion of the camera and reconstruct the 3D geometry of the scene, allowing filmmakers to integrate computer-generated imagery with live-action footage. It went on to win a wide variety of industry awards including the Computer Graphics World Innovation Award, the IABM Peter Wayne Award, the CATS Award for Innovation, and a technical EMMY.
​
Incorporating Machine Learning into Computer Vision
In the 1990’s computer vision had a focus on 3D reconstruction, however as the 2000’s unrolled there was a reawakened interest in machine learning. This integration with machine learning resulted in the development of algorithms for object recognition, face detection, and object tracking. These algorithms use machine learning to learn patterns in data and make predictions, enabling them to accurately identify and track objects in images and video.
​
Significant milestones in the history of machine learning in computer vision include:
The creation of the ImageNet benchmark database for object detection.
The use of convolutional neural networks for object recognition.
The development of GPUs driven by the gaming industry (essential for deep learning).
The availability of huge amounts of data via the internet, mobile devices and cheap storage to store it, ImageNet stimulated research into object recognition, and it was in the field of computer vision that deep networks first showed their worth by topping the ImageNet challenge in 2014.
The Emergence of Deep Learning in Computer Vision
In recent years, deep learning emerged as the dominant paradigm in computer vision, based on artificial neural networks trained on large datasets of labeled images. These networks can recognize complex patterns in images, and achieve state-of-the-art performance on a wide range of computer vision tasks, replacing the previous era of hand crafted features with machine learnt features.
Key deep learning models include AlexNet, Generative Adversarial Networks (GANs), and Vision Transformer (ViT). AlexNet is the first deep-learning architecture for object recognition. It achieved significant improvements in accuracy over previous state-of-the-art methods. GANs can generate realistic images and videos from scratch, while ViT achieved state-of-the-art performance on many computer vision tasks, including image recognition and object detection.
These new deep learning models along with Phil Torr’s extensive experience in computer vision went into one of his spinouts, Five.AI. Now owned by Bosch Group, Phil Torr helped direct and shape it from a company consisting of just six founders into an organization with over 140 employees. With Phil Torr’s help, they raised over £12 million from venture capitalists and a matching government contribution as part of a 2017 CCAV project dubbed "StreetWise".
Current and Future Developments of Computer Vision
After his work at Five.AI, Phil Torr went on to found Aistetic with Duncan McKay with the simple purpose to make next-gen 3D body modelling accessible to anyone with a mobile device. Aistetic, is a University of Oxford spinout, that offers an innovative & state-of-the-art AI solution for 3D body measurement and sizing recommendations. Aistetic's software plugs into existing brands' and retailers' websites and apps, allowing people to get their measurements and sizing for that specific retailer before making a purchase. This technology provides a solution to the common problem of buying clothes online that don't fit properly. Aistetic has won 2 Innovate UK Grants, been part of the Digital Catapult’s Machine Intelligence Garage, University of Edinburgh AI Accelerator and holds a grant from the Future Fashion Factory & University of Leeds.
With Aistetic's software, people can shop confidently using any mobile phone. Aistetic's technology is a prime example of the exciting advancements in computer vision and machine learning that are changing the way everyday people interact with technology.
Summary
Computer vision has come a long way since its early days in the 1960s. From the first edge detection algorithm to the latest deep learning models, computer vision has made significant progress in object recognition, face detection, object tracking, and more. Phil Torr continues to contribute toward the progress of computer vision with his latest project Aistetic which aims to revolutionize the online shopping experience.
​
To use Aistetic’s state-of-the-art computer vision in your ecommerce brand & clothing business, please book a meeting with one of our product specialists today.