An introduction to the role of entities in information retrieval systems
The concept of entities has gained recent prominence within SEO, as Google’s algorithms and information retrieval systems have indicated a greater significance and weighting to them.
The concept of entities in SEO, however, isn’t new. Key early papers and research into entity recognition include:
- Using Tf-Idf to Determine Word Relevance in Document Queries, J. Ramos 2003
- Weakly-supervised discovery of named entities using web search queries, Pasca 2007
- Query-drift prevention for robust query expansion, Zighelnic & Kurland 2008
- Named entity recognition in query, J. Guo, G. Xu, X. Cheng, and H. Li 2009
Hummingbird, RankBrain, and the renowned Machine Learning algorithm are examples of known Google components that help detect entities within search queries and better associate them with documents stored within Google’s many shards.
The majority of current articles on entity optimisation look at just that, the optimisation of, and often propose the exploitation of specific entities and their relationships with user queries.
For example, we can infer that a user searching for citrus fruits may potentially find value in documents talking about oranges, lemons, and grapefruits.
What is an entity
An entity can be a person, a physical place, an organisation, or a miscellaneous invention of human thought processes such as laws, religion, or currency.
The key criteria is that the entity needs to have a distinct and separate existence.
In terms of web based search, it makes sense that search engines (which are effectively advanced informational retrieval systems) do this.
By associating entities with user keywords, it’s more likely that the search engine will return a diverse set of results to match a number of user intents than if they took the search query as verbatim and didn’t expand beyond this.
One way to think about this, and to better introduce the concept to clients, is to recognise and describe Google’s Knowledge Graph as an “entity graph”.
This is one of Google’s forays into using semantic/entity search at scale.
It has been branded as “AEO” (Answer Engine Optimisation) in a way to explain the shift in information retrieval methods — but in reality, search engines have always been focused on relaying relevant answers to specific search queries; they’ve just gotten better at it.
An entity-based search system
There are three essential components in any information retrieval system, these are:
- A document model.
- A query model.
- A retrieval model (that matches both the document and query model).
When given a query and a document, the entity-based search system would identify entities mentioned within relevant documents.
Below are some examples of entity recognition within documents:
Search Query | Wikipedia Intro (Entities Bolded) |
[chelsea football club] | Chelsea Football Club is a professional football club in Chelsea, London, England, that competes in the Premier League, the highest tier of English football. |
[pyotr ilyich tchaikovsky] | Pyotr Ilyich Tchaikovsky, was a Russian composer of the romantic period, whose works are among the most popular music in the classical repertoire. |
[djibouti] | Djibouti, on the Horn of Africa, is a mostly French– and Arabic-speaking country of dry shrublands, volcanic formations and Gulf of Aden beaches. |
The query drift problem
Studies have found that query expansion can yield an adequate results set (on average), but that the performance can be inferior to that of using the original search query — this is known as the query drift problem.
For example, if you search for [silent films] you will likely get search results discussing “the best silent films”, actors, directors, and potentially some commercial listings.
Getting results for Charlie Chaplin would also be appropriate for the query.
However, in a “classical” query expansion system, the search phrase [charlie chaplin] would be “added” to the original query of [silent films] as two keywords, [charlie] and [chaplin].
Because of this, systems using a classic query expansion model will check documents for [charlie], [chaplin], and [silent films] without respecting that [charlie chaplin] in itself is an entity.