How Deep Learning Networks Can Use Virtual Worlds To Solve Real World Problems

Panorama created with the Unreal 4 game engine. Credit: Wikimedia

Deep learning is a type of artificial intelligence that has been successfully applied in areas ranging from voice, image and text recognition to game playing, cybersecurity and emotion identification. Success in all of these fields is rooted in the ability of deep learning networks to extract useful information from unstructured real-world data such as collections of pictures or webcam videos of human faces. Although deep learning networks provide solutions to real-world problems, assets created in virtual worlds can help solve some of the difficulties that are associated with the process through which the networks learn.

For an introduction to deep learning see “What Is Deep Learning And How Is It Useful”.

Training deep neural networks

Deep learning networks excel at identifying and learning patterns in unstructured data sets. For example, a network may learn to identify brand logos in pictures posted on social media, health-threatening abnormalities in x-rays and MRIs, or human emotions from facial expressions captured by webcams.

The learning process involves an extended period of training in which the network is given examples of the thing it is designed to learn (pictures from social media that may contain brand logos, for example), extracts information from the examples (learns which logos, if any, are in the pictures), tests itself to determine whether the information it extracted improved its ability to recognize the examples (tests whether logo recognition improved), and then adjusts itself so that the next time it tries it will do a better job (insures that logo recognition will be better the next time around). This learning process repeats until the network has achieved a predetermined level of accuracy in identifying whatever it is that it was designed to learn.

For real-world applications, training usually demands access to very large sets of training data. Consider the problem faced by a deep learning network that is designed to identify brand logos in pictures. The logo may be located anywhere in the picture. It may be large or small, right-side up or upside down. It may be viewed straight on or at any angle. It may be in or out of focus. If the network is going to be useful for companies tracking how their products appear on social media, it has to learn to recognize the logo under all of these circumstances which means the training data must include pictures that show all of these different cases. Training the network may require access to hundreds of thousands – or possibly millions – of pictures.

Training deep learning networks doesn’t only demand having access to large sets of training data, it demands having access to large sets of the right kind of training data. Hundreds of thousands of pictures can be scraped from a social media platform like Twitter or Facebook but if very few of those pictures contain the information of interest (such as brand logos), they are not going to be very useful for training a deep learning network.

Having access to massive data sets may not be a problem for companies like Google that uses deep learning networks to improve their search algorithms because Google currently processes more than 3.5 billion searches every day. Gaining access to large sets of training data may be a significant problem for smaller companies or for academic researchers.

Bringing virtual worlds into the picture

Modern game development engines have achieved a level of 3D realism that suggests an alternative approach to fulfilling the need for massive sets of training data. You can make your own virtual data instead of trying to acquire enough real data to train your network. Adrien Gaidon, a Research Scientist in the computer vision group at Xerox Research Centre Europe, is doing just that. He’s using the Unity game development engine to create virtual scenes that can be used to train deep learning networks.

One of Gaidon’s projects is developing a deep learning system that identifies empty parking places from video taken by cameras on buses as they drive through the streets. The network is trained on a combination of real-world video and virtual-world scenes created with Unity. Gaidon and his collaborators use a 3D laser scanner to capture a real-world scene which is then recreated in the game development engine. The picture above shows the real-word images on the left and the virtual recreations on the right.

Having virtual-world representations of real-world scenes makes it relatively easy to generate training data for deep learning networks because once the virtual scenes are created, they can be duplicated and manipulated in any number of ways. For example, if you want a network to be able to identify empty parking places on both sides of the street using real-world training data, you have to obtain video footage that shows empty spots on both sides of the street from vehicles moving in both directions. With a virtual 3D recreation of a street scene, you can create empty spaces by removing parked cars from the image, and can recreate the effect of camera-equipped vehicles moving in different directions by simply changing the camera location in the virtual scene.

The flexibility of virtual worlds isn't limited to removing objects and changing camera positions. Objects can also be modified and new objects can be added . Visual angle can be easily manipulated along with camera position. Lighting can be changed at will.

This high degree of flexibility has the potential to solve a common problem that arises when training deep learning networks for robust real-world applications. Situations that the network is expected to handle but were poorly represented in the training data will inevitably occur. For example, a network designed to spot empty parking spaces using real-world video will have to be able to deal with environmental conditions like rain, fog and snow, or the changes in light and shadow that occur throughout the day. These conditions may not have occurred in the available real-world training data. Technical artifacts like lens flare or motion blur may also be lacking in the training data and present in the day-to-day video the network has to work with. It’s generally a simple matter to add these environmental conditions and technical artifacts to virtual-world scenes.

Gaidon’s work thus far has focused on training and testing deep learning networks with prerendered virtual assets. However, there seems to be no reason why training and testing imagery could not be built from scratch and precisely tuned to the training needs of a specific network application. The skills needed may be more easily found in the art department of a game developer’s studio than among the software engineers of a company building deep learning applications, but that’s a problem that’s easily solved.

The potential benefits of using virtual data to train and test deep learning networks that are designed to work in the real world are only of value if the networks perform well with real-world input after they have been trained with virtual data. In an email exchange, Gaidon told me that networks trained only on virtual data are not quite as accurate as networks trained solely on real data although they are close. Mixing virtual data with a relatively small amount of real data late in the training process has been shown to improve performance. How to mix real and virtual data together for training is an active area of research and optimal training mixtures have yet to be identified. Gaidon will be presenting a paper at the annual Computer Vision and Pattern Recognition Conference in Las Vegas in late June that addresses these issues in detail.

"Oh, it's game on now"

Using game development engines to create realistic virtual data opens up intriguing possibilities for training deep learning networks. The ability to design virtual training data that are carefully tuned to the requirements of specific deep learning applications may make the training process more efficient and less time consuming. The relative ease with which large quantities of virtual data can be created may open up powerful deep learning solutions to users that do not have ready access to massive sets of real-world training data.

Deep learning is proving to be a powerful tool for extracting useful information from unstructured data in order to provide solutions across a broad range of fields. The addition of near photorealistic imagery created with modern game development engines holds out the possibility of deep learning becoming even more useful and more widely available in the near future.

Follow me on Twitter or LinkedIn. Check out my website.

More From Forbes

How Deep Learning Networks Can Use Virtual Worlds To Solve Real World Problems