Latent Dirichlet Allocation (LDA) Explained Simply
You might have heard the term “semantic analysis”, “latent semantic analysis” or even the terrifying “latent dirichlet allocation” (LDA) batted around in conversations with SEO company reps, or by SEO newbies who want to sound smart. It’s a moderately complex topic that’s strikingly undocumented from an SEO perspective, so here’s a simple explanation:
Latent Dirichlet Allocation is a model that determines the meaning of a word or document by looking at words that appear nearby.
There you go! Not so bad, is it? The complex part comes in when we start looking at how the search engines use LDA in their algorithms.
How Google Knows What You Mean
In order to better determine user intent, search engines have to know the context that a search is performed in; searching for Ice (the jewelry store) and ice (the solid form of two parts hydrogen, one part oxygen) are not the same thing, and you probably wouldn’t want images of earrings and bracelets if you’re just looking for a way to make completely transparent ice cubes (Protip: Boil the water first).
So how does Google know the difference? By evaluating the words as a unit, the engines can infer that when “ice” appears next to words like ring, gift, Valentine’s and so on, that it’s probably referring to the jewelry store or the colloquial term, whereas an appearance by freeze, cold, north, kitchen and so on would lean more towards the condensed H20 side of things.
If that’s all you came here for, great. Go tweet this article or subscribe to the email list for more stuff. If you really want to get into the meat of LDA, read on.
Deeper into Latent Dirichlet Allocation
LDA works on assumptions of probability. Through machine learning, a program can accrue knowledge of words and their meanings by evaluating a probability that the word or phrase will appear next to another specific word. Here’s an oddly appropriate example:
One of the leading researchers into Latent Dirichlet Allocation is a man named Michael Jordan. No, not the basketball player, but the fact that you went right to that is exactly what Google is counting on. Unfortunately, from the sheer volume of mentions of the name “Michael Jordan” and its appearance adjacent to words like basketball, sports, Nike, and so on, there is almost no way the Wikipedia page or the scholarly papers by Michael I. Jordan will appear highly in the SERPs for “Michael Jordan.” Until you add a clarifying word or phrase—like machine learning, researcher, or artificial intelligence—you won’t find many mentions of this man under a search for just his name.
A document’s topic is assumed based on the contents of the document as they relate to each other. As the algorithm gathers more data on keywords and where they appear, results become more statistically relevant. Search queries for “Michael Jordan” tend to return more basketball-related results unless joined by a modifier—like “researcher”—that changes the semantic meaning of “Michael Jordan.”
Why Keyword Density Doesn’t Matter
This model of identifying, processing, and assigning meanings to words and phrases allows search engines to return results that more accurately match a user’s query, based on the inferred meaning. This also means that a result doesn’t have to match a searcher’s wording exactly.
Since “Michael Jordan” shows up so frequently near “basketball,” a search for “Michael Jordan stats” will return results for the basketball player, rather than the researcher, even though Michael Jordan the researcher was a huge force in the field of statistics.
Continuing the example, “Michael Jordan stats” returns a page titled “Michael Jordan Career Stats,” showing a bit of intent matching on the side of Google.
This development is one of the key reasons that keyword density is completely irrelevant; stuffing a page full of exact-match keywords does you no good, since search engines can infer the meaning of the searchers query without relying on explicit text matching. And because the engines can interpret (to some extent) the meaning of phrases to match underlying intent, they’re able to extend their reach to analyze the quality of the document; so if your article is stuffed with manufactured keywords that feel unnatural, you’re going to be hurting for rankings.
It also means that alternate word forms or orders often don’t matter when it comes to searches. You’ll notice that searches for “Michael Jordan” and “Jordan Michael” return very similar results; Google knows what you’re looking for, and is giving you the most relevant results for your question, instead of matching the search exactly.
With that said, it’s important to understand that there’s still a heavy bias towards exact text matching in results; exact-match domains are very powerful for capturing traffic for specific phrases, but Latent Dirichlet Allocation is one of the biggest steps towards semantic analysis and machine understanding, and away from raw text-matching.
So the next time you’re trying to decide whether you should use “fluorescent lightbulbs for sale” or “fluorescent lightbulb sale” in a sentence, relax and write naturally. Google knows what you mean.
Latent Dirichlet Allocation Doesn’t Solve Everything
As you may have guessed, this can lead to some problems when the engine guesses wrong, or the document’s wording is confusing. It can be particularly burdensome to writers that favor wordplay and clever titles.
If you’re writing an article for the Rolling Stone ABOUT the Rolling Stones and you lead off with a clever introduction about rolling stones gathering no moss or how they’re “precious stones” or something like that, the engines are going to have a hard time sifting out the precise meaning of your article because of the multiple variations of the word “stones” based on the words you’ve placed near them. I’m not sure about this particular example, and whether the engines would recognize the proper nouns and be able to distinguish the band from the magazine, but the theory is there.