Ambitions in A.I. - Google Photos, why now?
Sat, Aug 8, 2015 ❝❞Contents
During the last Google I/O, Google I/O 2015, Google announced the free photo storage solution Google Photos. Some of the limitations of the services are related to quality. The upper bound is set at 16 Megapixels resolution for photos and 1080p resolution for videos, but this is far within acceptable bounds for storage of photos and videos for home use.
So why does Google, seemingly out of nowhere and with its timing off in terms of the current hypes, start this free photo storage service? Some speculate on Google’s desire to have more information about users. Although there is obviously some truth to that, others have noted that there are other applications.
Google is working on artificial intelligence as we all know. They have for some time, for a number of different applications. The most obvious one is for image recognition. This started with CAPTCHAs for simple, relatively small samples. The CAPTCHA challenges were pieces from books or articles that are hard to read using Google’s state of the art OCR capabilities of that time. They also ventured in the area of speech recognition, using recordings from Google Talk and (probably) Google Hangout as training data. This resulted in fine support for speech recognition for voice commands in various systems, such as Chrome. They also extended into more advanced feats of computer vision, such as recognizing traffic signs and other relevant aspects of traffic situations, as this is required for driving autonomous vehicles, such as Google’s driverless car experiments.
Others have noted this and it is not too hard to see what Google could do with users’ photos. However, I do not believe that the effort is limited to getting a bit more insight into the user. I think that is too short sighted for a company like Google, who have shown us that they think in quite ambitious scales. (Although I do not contest that the gained user information would not have its uses.)
It is obvious to see how Google could use speech recognition and computer vision to discover more information about the user. Discovering information in general may turn out to be far more important. Speech, conversations, can be used for a computer to learn to understand a conversation in context. It is also a way to gather data off the conversation itself. However, in itself is not enough to understand many aspects of the physical world. By venturing into the world of complex visual data, visual representations of the world as we experience it, we can let a computer learn far more. Also note that this is in addition to the capability of understanding audio and other data that is already available.
Google Photos may prove to be a quite significant missing link in training an artificial intelligence. Photos provide (almost always) clear and unmodified visual data. The resolution of photos that is now common, provides a detailed enough view that it is possible to distinguish any object that is also visible to the naked eye. Furthermore, the metadata attached to photos provides location data, approximate date and time, parameters of how the photo was taken, and other useful information.
Google Photos already offers a number of “easy A.I. tricks” for improving photos for the user. Using intelligent choices for enhancing sharpness, brightness, color balance and other properties of photos so users do not have to figure it out manually. But again, that is just a gimmick compared to feats such as synthesizing a video based on a number of photos.
I expect that quite soon Google will be able to demonstrate some impressive multi-disciplinary intelligent operations. For example: Imagine how many times the Eiffel Tower is captured on photo. Let’s make a conservative guess, say at least 1 million times. Now all of these pictures get on Google Photos. So what could we do with this?
- Based on computer vision alone, without any additional information: recognize the structure that is the subject of the photo: the Eiffel Tower.
- Computer vision combined with some metadata: determine the location of the photographer when he made the photo. Which side of the Eiffel Tower is captured.
- Computer vision with some more metadata: determine the time and place of the photograph and the properties for that time and place in the current presentation of the Eiffel Tower, such as the amount of light falling in. Also, during the night, approximately recognize (artificial) lighting.
- Initially, these images combined with weather information should be enough to make some distinctions of the weather on the day compared to the amount of light captured in the photo and which parts of the Eiffel Tower were darkened because of less light. In time, it should be able to derive approximate weather conditions, just by looking at the images. Given the availability of time and date metadata it should be able to distinguish between seasonal light variations and weather-related light variations, such as distinguishing a winter day from a cloudy day.
- With some amount of training, it should be feasible to detect an approximate time of the day and time of the year of when a photo is taken, even without this information being available in the metadata. Just by observing the amount of daylight and sunlight. (Note that here it becomes valuable even necessary that an A.I. can reason about photos and can correct deviations in order to work with a consistent and complimenting set of visual data, such as color correction and white balance.) In time, it should even be able to create a perfect 3D model of the Eiffel Tower and simulate any lighting condition, day or night, as well as shadows.
- As a derivative of known information, it should even be able to detect unexpected, “unexplained” events that influence the vision of the Eiffel Tower as captured by the photo. Events such as the casting of a shadow. Google’s A.I. could notify Eiffel Tower maintenance in case lights are dead. Given the vast number of photos, it is feasible to observe subtle changes over time.
Hypothetically, if a building would be built next to it, and it casts a (new) shadow on the Eiffel Tower, the A.I. should be “smart” enough to know what to investigate - where to look - in order to find out what causes this. It should be able to correlate different images to “find” this “object” that is casting the shadow.
It is quite amazing what we can derive given sufficient data. Data on sound and vision, position in space (i.e. where on the earth) and time. This is just an arbitrary example that I thought of. By no means anything exceptional. If anything, it is probably quite modest. Would it be a complex problem to create for an A.I. a natural motivation to understand and a drive to pursue, as opposed to a static one that is hardcoded in by hand? Things such as an unexplained shadow might provide a sufficient initial trigger, but some form of curiosity in the A.I. is needed to actually start looking.
Note that this is my own opinion, based on guessing, extrapolation of known applications, a bit of imagination and my own impression of past efforts and capabilities of Google. I have not been in contact with Google for any of this.