Disambiguating Image Queries at Google

Better Understanding Image Queries

Years ago, I wouldn’t have expected a search engine telling a searcher about objects in a photograph or video, but search engines have been evolving and getting better at what they do

In February, Google was granted a patent to help return image queries from searches involving identifying objects in photographs and videos. A search engine may have trouble trying to understand what a human may be asking, using a natural language query, and this patent focuses upon disambiguating image queries.

The patent provides the following example:

For example, a user may ask a question about a photograph that the user is viewing on the computing device, such as “What is this?”

The patent tells us that the process in it maybe for image queries, with text, or video queries, or any combination of those.

In response to a searcher asking to identify image queries, a computing device may:

Capture a respective image that the user is viewing
Transcribe the question
Transmit that transcription and the image to a server

The server may receive the transcription and the image from the computing device, and:

Identify visual and textual content in the image
Generate labels for images in the content of the image, such as locations, entities, names, types of animals, etc.
Identify a particular sub-image in the image, which may be a photograph or drawing

The Server may:

Identify part of a particular sub-image that may be of primary interest to a searcher, such as a historical landmark in the image
It may perform image recognition on the particular sub-image to generate labels for that sub-image
It may also generate labels for text in the image, such as comments about the sub-image, by performing text recognition on a part of the image other than the particular sub-image
It may then generate a search query based on the transcription and the generated labels
That query may ben be providee to a search engine

The Process Behind Disambiguating a Visual Query

The process described in this patent includes:

Receiving an image presented on, or corresponding to, at least a part of a display of a computing device
Receiving a transcription of an utterance spoken by a searcher, when the image is being presented
Identifying a particular sub-image included in the image, and based on performing image recognition on the particular sub-image
Determining one or more first labels that show a context of the particular sub-image
Performing text recognition on a part of the image other than the particular sub-image
Determining one or more second labels showing the context of the particular sub-image, based on the transcription, the first labels, and the second labels
Generating a search query
Providing, for output, the search query

Other Aspects of performing such image queries searches may involve:

Weighting a first label differently than a second label: the search query may substitute one or more of the first labels or the second labels based upon terms in the transcription
Generating, for each of the first labels and the second labels, a label confidence score that indicates a likelihood that the label corresponds to a part of the particular sub-image that is of primary interest to the user
Selecting one or more of the first labels and second labels based on the respective label confidence scores, wherein the search query is based on the one or more selected first labels and second labels
Accessing historical query data including previous search queries provided by other users
Generating, based on the transcription, the first labels, and the second labels, one or more candidate search queries
Comparing the historical query data to the one or more candidate search queries
Selecting a search query from among the one or more candidate search queries, based on comparing the historical query data to the one or more candidate search queries

The method may also include:

Generating, based on the transcription, the first labels, and the second labels, one or more candidate search queries
Determining, for each of the one or more candidate search queries, a query confidence score that indicates a likelihood that the candidate search query is an accurate rewrite of the transcription
Selecting, based on the query confidence scores, a particular candidate search query as the search query
Identifying one or more images included in the image
Generating for each of the one or more images included in the image, an image confidence score that indicates a likelihood that an image is an image of primary interest to the user
Selecting the particular sub-image, based on the image confidence scores for the one or more images
Receiving data indicating a selection of a control event at the computing device, wherein the control event identifies the particular sub-image. (The computing device may capture the image and capture audio data that corresponds to the utterance in response to detecting a predefined hotword.)

Further, the method may also include:

Receiving an additional image of the computing device and an additional transcription of an additional utterance spoken by a user of the computing device
Identifying an additional particular sub-image that is included in the additional image, based on performing image recognition on the additional particular sub-image
Determining one or more additional first labels that indicate a context of the additional particular sub-image, based on performing text recognition on a portion of the additional image other than the additional particular sub-image Determining one or more additional second labels that indicate the context of the additional particular sub-image, based on the additional transcription, the additional first labels, and the additional second labels
Generating a command, and performing the command

Performing the command can include:

Storing the additional image in memory
Storing the particular sub-image in the memory
Uploading the additional image to a server
Uploading the particular sub-image to the server
Importing the additional image to an application of the computing device
Importing the particular sub-image to the application of the computing device
Identifying metadata associated with the particular sub-image, wherein determining the one or more first labels that indicate the context of the particular sub-image based further on the metadata associated with the particular sub-image

Advantages of following the image queries process described in the patent can include:L

The methods can determine the context of an image corresponding to a portion of a display of a computing device to aid in the processing of natural language queries
The context of the image may be determined through image and/or text recognition
The context of the image may be used to rewrite a transcription of an utterance of a user
The methods may generate labels that refer to the context of the image, and substitute the labels for portions of the transcription, such as “Where was this taken?”)
The methods may determine that the user is referring to the photo on the screen of the computing device
The methods can extract information about the photo to determine the context of the photo, as well as a context of other portions of the image that do not include the photo, such as a location that the photo was taken

This patent can be found at:

Contextually disambiguating queries
Inventors: Ibrahim Badr, Nils Grimsmo, Gokhan H. Bakir, Kamil Anikiej, Aayush Kumar, and Viacheslav Kuznetsov
Assignee: Google LLC
US Patent: 10,565,256
Granted: February 18, 2020
Filed: March 20, 2017

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for contextually disambiguating queries are disclosed. In an aspect, a method includes receiving an image being presented on a display of a computing device and a transcription of an utterance spoken by a user of the computing device, identifying a particular sub-image that is included in the image, and based on performing image recognition on the particular sub-image, determining one or more first labels that indicate a context of the particular sub-image. The method also includes, based on performing text recognition on a portion of the image other than the particular sub-image, determining one or more second labels that indicate the context of the particular sub-image, based on the transcription, the first labels, and the second labels, generating a search query, and providing, for output, the search query.

Source link

Facebook Tweet LinkedIn

See All Blogs

Disambiguating Image Queries at Google

The Process Behind Disambiguating a Visual Query

Sectors:

INTERESTED IN STRATEGIC WEBSITE AUDIT