fbpx

Disambiguating Image Queries at Google

Disambiguating Image Queries At Google

Better Understanding Image Queries

Years ago, I wouldn’t have expected a search engine telling a searcher about objects in a photograph or video, but search engines have been evolving and getting better at what they do

In February, Google was granted a patent to help return image queries from searches involving identifying objects in photographs and videos. A search engine may have trouble trying to understand what a human may be asking, using a natural language query, and this patent focuses upon disambiguating image queries.

The patent provides the following example:

For example, a user may ask a question about a photograph that the user is viewing on the computing device, such as “What is this?”

The patent tells us that the process in it maybe for image queries, with text, or video queries, or any combination of those.

In response to a searcher asking to identify image queries, a computing device may:

  • Capture a respective image that the user is viewing
  • Transcribe the question
  • Transmit that transcription and the image to a server

The server may receive the transcription and the image from the computing device, and:

  • Identify visual and textual content in the image
  • Generate labels for images in the content of the image, such as locations, entities, names, types of animals, etc.
  • Identify a particular sub-image in the image, which may be a photograph or drawing

The Server may:

  • Identify part of a particular sub-image that may be of primary interest to a searcher, such as a historical landmark in the image
  • It may perform image recognition on the particular sub-image to generate labels for that sub-image
  • It may also generate labels for text in the image, such as comments about the sub-image, by performing text recognition on a part of the image other than the particular sub-image
  • It may then generate a search query based on the transcription and the generated labels
  • That query may ben be providee to a search engine

The Process Behind Disambiguating a Visual Query

The process described in this patent includes:

  • Receiving an image presented on, or corresponding to, at least a part of a display of a computing device
  • Receiving a transcription of an utterance spoken by a searcher, when the image is being presented
  • Identifying a particular sub-image included in the image, and based on performing image recognition on the particular sub-image
  • Determining one or more first labels that show a context of the particular sub-image
  • Performing text recognition on a part of the image other than the particular sub-image
  • Determining one or more second labels showing the context of the particular sub-image, based on the transcription, the first labels, and the second labels
  • Generating a search query
  • Providing, for output, the search query

Other Aspects of performing such image queries searches may involve:

  • Weighting a first label differently than a second label: the search query may substitute one or more of the first labels or the second labels based upon terms in the transcription
  • Generating, for each of the first labels and the second labels, a label confidence score that indicates a likelihood that the label corresponds to a part of the particular sub-image that is of primary interest to the user
  • Selecting one or more of the first labels and second labels based on the respective label confidence scores, wherein the search query is based on the one or more selected first labels and second labels
  • Accessing historical query data including previous search queries provided by other users
  • Generating, based on the transcription, the first labels, and the second labels, one or more candidate search queries
  • Comparing the historical query data to the one or more candidate search queries
  • Selecting a search query from among the one or more candidate search queries, based on comparing the historical query data to the one or more candidate search queries

The method may also include:

  • Generating, based on the transcription, the first labels, and the second labels, one or more candidate search queries
  • Determining, for each of the one or more candidate search queries, a query confidence score that indicates a likelihood that the candidate search query is an accurate rewrite of the transcription
  • Selecting, based on the query confidence scores, a particular candidate search query as the search query
  • Identifying one or more images included in the image
  • Generating for each of the one or more images included in the image, an image confidence score that indicates a likelihood that an image is an image of primary interest to the user
  • Selecting the particular sub-image, based on the image confidence scores for the one or more images
  • Receiving data indicating a selection of a control event at the computing device, wherein the control event identifies the particular sub-image. (The computing device may capture the image and capture audio data that corresponds to the utterance in response to detecting a predefined hotword.)

Further, the method may also include:

  • Receiving an additional image of the computing device and an additional transcription of an additional utterance spoken by a user of the computing device
  • Identifying an additional particular sub-image that is included in the additional image, based on performing image recognition on the additional particular sub-image
  • Determining one or more additional first labels that indicate a context of the additional particular sub-image, based on performing text recognition on a portion of the additional image other than the additional particular sub-image Determining one or more additional second labels that indicate the context of the additional particular sub-image, based on the additional transcription, the additional first labels, and the additional second labels
  • Generating a command, and performing the command

Performing the command can include:

  • Storing the additional image in memory
  • Storing the particular sub-image in the memory
  • Uploading the additional image to a server
  • Uploading the particular sub-image to the server
  • Importing the additional image to an application of the computing device
  • Importing the particular sub-image to the application of the computing device
  • Identifying metadata associated with the particular sub-image, wherein determining the one or more first labels that indicate the context of the particular sub-image based further on the metadata associated with the particular sub-image

Advantages of following the image queries process described in the patent can include:L

  • The methods can determine the context of an image corresponding to a portion of a display of a computing device to aid in the processing of natural language queries
  • The context of the image may be determined through image and/or text recognition
  • The context of the image may be used to rewrite a transcription of an utterance of a user
  • The methods may generate labels that refer to the context of the image, and substitute the labels for portions of the transcription, such as “Where was this taken?”)
  • The methods may determine that the user is referring to the photo on the screen of the computing device
  • The methods can extract information about the photo to determine the context of the photo, as well as a context of other portions of the image that do not include the photo, such as a location that the photo was taken

This patent can be found at:

Contextually disambiguating queries
Inventors: Ibrahim Badr, Nils Grimsmo, Gokhan H. Bakir, Kamil Anikiej, Aayush Kumar, and Viacheslav Kuznetsov
Assignee: Google LLC
US Patent: 10,565,256
Granted: February 18, 2020
Filed: March 20, 2017

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for contextually disambiguating queries are disclosed. In an aspect, a method includes receiving an image being presented on a display of a computing device and a transcription of an utterance spoken by a user of the computing device, identifying a particular sub-image that is included in the image, and based on performing image recognition on the particular sub-image, determining one or more first labels that indicate a context of the particular sub-image. The method also includes, based on performing text recognition on a portion of the image other than the particular sub-image, determining one or more second labels that indicate the context of the particular sub-image, based on the transcription, the first labels, and the second labels, generating a search query, and providing, for output, the search query.

Source link

Digital Strategy Consultants (DSC) © 2019 - 2024 All Rights Reserved|About Us|Privacy Policy

Refund Policy|Terms & Condition|Blog|Sitemap