The Quest to Give Robots the Power of Human Sight

Digital eye

Light is all we can see. More specifically, any visible light that reaches our eyes is transformed into electrical signals.

Our brains then receive and interpret those signals to form mental images according to specific properties of the light that generated them. Thus, we could argue that colors, for example, do not exist outside our perception, since they are just our brain’s interpretation of streams of photons with different wave properties. The bright world we see, with auroras and sunsets and the beautiful colored patterns in butterflies are nothing but energy, when there are no eyes to see.

Contrary to what a simplified physical description might imply, vision is one of the most complex biological mechanisms. Among other complex functions, our nervous system perceives depth, processes shapes and motion, separates objects from the background, and distinguishes visual changes as the body moves. And it takes a lot of brain power to do so. An estimated 30-50% of the human brain is devoted to vision. The brain’s visual system is so elegantly complex and efficient that scientists have been modelling it with modern computers to advance new technologies.

Visual processing is one of the most studied systems of the human brain. Our vision is constructed through a cascade of cells organized in a hierarchical system that delivers data to create mental images of whatever we look at. Once scientists understood this layered structure, they were able to apply this knowledge to major advances in the ability of computers to identify visual patterns. They began a quest to write algorithms based on the vision architecture, hoping to replicate the human ability of visual pattern recognition in machines.

Complex human vision challenges AI

The invention of the neural network ‘Neocognitron’ by Dr. Kunihiko Fukushima in the 1980s, which mimicked biological architecture, was a landmark in our journey to understand vision. The Neocognitron was the first of its kind able to identify meaningful symbols. For instance, after training, this network could recognize typed and even handwritten characters. In the 1990s, the Neocognitron was followed by the more advanced, also biologically inspired, Hierarchical Model and X, or HMAX. Developed by Dr. Tomaso Poggio, the HMAX was more flexible in the task of recognizing objects.

To this day, the same logic still underlies cutting-edge algorithms that train computer vision systems to identify objects in motion, such as the technology for autonomous driving vehicles. Computers with visual intelligence can now identify objects and people and reconstruct three-dimensional scenes from two-dimensional images. This technology is currently employed in many areas, such as in medical identification of tumors or agricultural weeding. Researchers are also developing algorithms to explore the bottom of the ocean and defuse dangerous minefields. Although there are many unresolved ethical issues, the technology is being used for facial recognition and identification of criminal suspects, and in the future could be used to anticipate patterns of criminal behavior.

The breakthroughs in this area had a boost with the development of deep learning. This is a type of machine learning that can be understood as a hierarchical feature learning based on the brain’s neural network. This means that observed data is generated via the interactions of many factors distributed in multiple levels. Although deep learning architecture is inspired by the human brain, the currently available technology limits the efficiency of these robots, and their visual recognition takes a lot more energy to work than nature’s version. If human vision depended on their methods, seeing probably wouldn’t feel so effortless. Thus, there is still a long way for computer scientists to effectively mimic human vision in machines.

“Machine learning is a subfield of artificial intelligence, which is broadly defined as the capability of a machine to imitate intelligent human behavior. Artificial intelligence systems are used to perform complex tasks in a way that is similar to how humans solve problems”. MIT Management School.

Brain-like computer.

Among the several challenges related to trying to replicate such a complex system, differentiating objects and people is particularly difficult for computers. Unlike most animals, machines have a hard time distinguishing between same and different. This is because there is the abstract component, a power of generalizability, lying within this differentiation that machines (actually, their developers) have a hard time putting together. Not coincidentally, this ability is a crucial hallmark of intelligence, foundational for all kinds of inference. This shows that we, humans, probably shouldn’t take this ability for granted, and perhaps acknowledge the value in the flexibility of real intelligence over artificial intelligence.

Distinguishing the same and the different

To Dr. Matthew Ricci, post-doctoral fellow at the Hebrew University of Jerusalem, truly intelligent visual reasoning machines need the ability to recognize sameness and difference. The study of same-difference relations is so crucial that researchers have already started incorporating not only visual scenes into the training, but also natural language and physical interactions, said Dr. Adam Santoro, a researcher at DeepMind. But it might still take a while before we have fully developed machine vision. Today, the Convolutional Neural Networks (CNNs) are the most powerful classes of artificial intelligence systems, yet only under very restricted conditions it can tell if two patterns are identical or not.

This difficulty was evident in a famous Google incident, in 2015, when its image-recognition algorithm auto-tagged pictures of Black people as gorillas. The company dealt with the problem by preventing the algorithms from ever labeling any image as a gorilla, chimpanzee, or monkey, even if the pictures were of the primates themselves, according to Wired magazine. Google responded that, although the solution was not perfect, this was the best available option, since machine learning technology is still maturing. Beyond exposing Google’s ethical and civil responsibilities, this incident highlighted current artificial intelligence limitations and, at the same time, the influence that this incipient technology already has on society.

Scientists are now studying the human brain in more detail, so they can enhance visual processing technology. Artificial neural networks based on brain architecture can already predict neural responses to images. Researchers do not entirely agree on what might happen in the future. For some, the CNNs lack a fundamental reasoning capability that can’t be overcome by inputting more data and training. Others believe they can be more useful than currently thought, with proper research. For Dr. Ricci, we may need a breakthrough before machines can properly learn same-different distinctions. Until then, the bright world we see, the sunsets and auroras, butterflies and waterfalls, they all shall remain exclusive to us humans.