Anthony Gentile

Search autosuggest and n-grams

January 25, 2011, 1:32pm

Having an interest in computational linguistics, I'd like to spend some time explaining an example use that we see a lot. Often times when browsing the web we use search forms to find what we are looking for. Many search forms try to help us find what we are looking for by automatically suggesting common search terms as we type. If you are searching an online shop you may only see suggestions for items for purchase, whereas searching on Google even very obscure terms may have suggestions. What exactly is going on here and how can you make your auto suggestion more robust and/or relevant?

Wildcard searching

Many times when you run into auto suggestion within a search form it is relevant to the content that has been indexed or stored in the database for that site. Lets say I have an online store and I use Solr, Sphinx, or some other search tool to index the titles of the items that are available for sale on my site. When people use my search form, I perform a wildcard search for the term e.g. TERM* OR TERM. If I search for cook I would get results for cookbook, cookies, and so forth if those items are in my index. So now, I simply add some javascript to my search box that says: after the user has typed at least three characters and there is a pause interval between typing of a certain period of time, perform an AJAX call that does a search of what the user has currently provided. Once it comes back, I display it in some nice box below the search input. With this implementation I am providing suggestions relevant to the possible results. Better search tools also allow assignment of weights and/or orders to my index so that cookies will come before cookbook in the autosuggest or whichever item sells the most will come up first.

N-grams

So how do I go beyond what I have indexed for autosuggestion. How would I provide suggestions like

collaborate when someone types in collaberate as the previous example of wildcard searching would fail at the first e. Well, lets talk about n-grams. If you have never heard of n-grams. Think of them as a fixed length window running across a word where the length is n. For example, lets look at the trigram (n=3) of the word 'dog'.

'  d', ' do', 'dog', 'og ', 'g  '

Using n-grams and some math we can do some pretty neat rankings of a term versus other terms in a set to find similarity. Let me give you an example in code. PostgreSQL provides for a trigram package. So I am going to create a database table with the words of the English dictionary and then use Postgres to tell me which words are similar based on the trigram ranking.

Following an example from co-worker, Hubert Lubaczewski to take advantage of GIST

CREATE TABLE english_words ( word text );
CREATE INDEX trgm_idx ON words USING gist (word gist_trgm_ops);

SELECT word, similarity(word, 'collaber') AS rank
FROM english_words
WHERE word % 'collaber'
ORDER BY rank DESC, word limit 10;


    word     |   rank
-------------+----------
 collar      | 0.454545
 collage     | 0.416667
 collars     | 0.416667
 collate     | 0.416667
 collier     | 0.416667
 Collier     | 0.416667
 collaborate |      0.4
 collages    | 0.384615
 collapse    | 0.384615
 collared    | 0.384615
(10 rows)

You can see how beneficial n-grams are for autosuggestion/completion. Using n-grams in conjunction with our indexed search data, we can be really effective in helping users find what they are looking for. Solr and Sphinx users will be happy to know that there is n-gram support available for both. This makes for some rather powerful search tools where you can combine weights, orders, and n-gram rankings to give users the best possible suggestions and results.

Google, Bing and many others already use n-grams and have provided their most frequently searched terms as publicly available web corpora. Bing has an n-gram web API available for use. Below is an example using PHP:

/*
http://web-ngram.research.microsoft.com/info/

bing-title = the index I am working with (bing-anchor, bing-body, bing-title, bing-query)
apr10 = time of index
3 = n-gram size (trigram)

u = my api key
format = (json,text,xml)
p = phrase (php developer)
n = result count
*/


$data = file_get_contents('http://web-ngram.research.microsoft.com/rest/lookup.svc/bing-title/apr10/3/gen?u=my-api-key&format=json&p=php+developer&n=5');
var_dump(json_decode($data));

/*
object(stdClass)#1 (4) {
  ["backoff"]=>
  float(-1.76163423)
  ["probabilities"]=>
  array(5) {
    [0]=>
    float(-0.647157848)
    [1]=>
    float(-0.7836181)
    [2]=>
    float(-0.9755778)
    [3]=>
    float(-1.24088967)
    [4]=>
    float(-1.49609017)
  }
  ["words"]=>
  array(5) {
    [0]=>
    string(1) "s"
    [1]=>
    string(3) "asp"
    [2]=>
    string(5) "tool"
    [3]=>
    string(4) ""
    [4]=>
    string(4) "jobs"
  }
}
*/

From the output we can see the words that are most likely to follow our phrase php developer.

Helpful additions

It should be noted that many full text search tools and databases provide a lot of helpful packages to solve problems like these and others in various ways. Some examples are:

I hope this article has helped shed some light on autosuggestion and shown a way in which computational linguistics is applied to the web.