Using Elasticsearch Completion Suggesters for Address Autosuggest

Here at REA we have implemented a street address autosuggest system using Elasticsearch’s Completion Suggester feature.  This turned out to be much more interesting and more challenging than expected, and so I thought I should share some of what we learnt along the way.

On the realestate.com.au website, we provide a web page for every property in Australia (more than eleven million of them) that includes the property’s current estimated value, its sale and rental history, and various other useful details. The site suggests completions of addresses as you type them.

For example, if you type

“511 churc”

Our autosuggest will suggest various addresses such as

“511 Church Street Richmond Vic 3121”

"511 church" suggests "511 Church St, Richmond", "511 Church Ave, Sandy Bay", ...

Address search autosuggest

Elasticsearch completion suggesters work a little differently to normal Elasticsearch/Lucene inverted indices. They are designed to support the prefix matching required by autocompletion more efficiently than the inverted indexes used for normal queries.

To use the completion suggesters, you need to use the special “completion” field type in your index mapping. For example, the following request will create a completion field named “suggest” within a document type called “address” within an index called “address”. Note that these examples work on Elasticsearch 2, but may need some modifications to work on more recent versions.

curl -XPUT "http://localhost:9200/address" -d'
{
  "mappings": {
    "address": {
      "properties": {
        "suggest": {
          "type": "completion"
        }
      }
    }
  }
}'

 

You can index your suggestions as you would normal Elasticsearch documents:

curl -XPOST "http://localhost:9200/address/address" -d'
{
  "suggest" : "511 Church St, Richmond, Vic 3121"
}'

 

However, if you wish to specify additional information to the suggestion text, then you need to use a longer form. For example, to specify “weights” to be used when ranking the suggestions, your document might look like this:

curl -XPOST "http://localhost:9200/address/address" -d'
{
  "suggest" : {
    "input" : "511 Church St, Richmond, Vic 3121",
    "weight" : 1234
  }
}'

 

To request suggestions, you can either specify a “suggestion” section within a normal query, or use the convenient _suggest end-point. You can make multiple suggestion queries within a single request, and so you need to give a name to the query. Here is an example of a query for completions of the text “511 Chur”:

curl -XGET "http://localhost:9200/address/_suggest?pretty=true" -d'
{
  "example-suggest" : {
    "text" : "511 Chur",
    "completion" : {
      "field" : "suggest"
    }
  }
}'

 

The response will contain a list of all the matching documents.

Matching on different inputs

One of the limitations of completion suggesters is that the matching of user input against suggestions is always strictly from the beginning of the suggestion. For example, the query “brown fox” will not match “quick brown fox”.  Sometimes, a user will type something like “3 Smith St” when the address they are looking for is actually “3-5 Smith St”.  To accommodate this, we take advantage of the fact that completion suggesters allow you to specify multiple variations against which the user’s input is to be matched. These are specified as an array of values in an “input” field of the document. When suggestions are requested, Elasticsearch responds with the contents of a separate “output” field.

So, for example, to index an address like:

5/1-3 Abby Court, West Moonah, Tas 7009

The corresponding document indexing request looks like this:

curl -XPOST "http://localhost:9200/address/address" -d'
{
  "suggest": {
    "output": "5/1-3 Abby Court, West Moonah, Tas 7009",
    "input": [
      "5/1-3 Abby Court, West Moonah, Tas 7009",
      "Abby Court, West Moonah, Tas 7009",
      "1-3 Abby Court, West Moonah, Tas 7009",
      "1 Abby Court, West Moonah, Tas 7009",
      "3 Abby Court, West Moonah, Tas 7009",
      "5/1 Abby Court, West Moonah, Tas 7009",
      "5/3 Abby Court, West Moonah, Tas 7009"
    ]
  }
}'

Providing multiple “inputs” in this manner ensures that, for example, the address will be suggested even when only the street is entered.

Abby Ct, West Moo expands to "1-3 Abby Ct" etc.

Address completion from street name

NOTE: Since Elasticsearch version 5, the “output” field is no longer supported.  Instead, completion suggesters in Elasticsearch 5 return the entire document containing the suggestion, and so the “output” and “payload” fields can now be placed elsewhere in the document (in fact they have to be placed elsewhere).

Fuzzy Matching

Completion suggesters support matching with minor misspellings so that, for example, “511 chorch st” matches “511 Church St”. This is achieved by switching on “fuzzy matching” at query time.

All else being equal, the service should list exact matches before listing fuzzy matches. For example, when the user types “1 Gingel”, we display “1 Gingella” before “1 Gangele”.  Unfortunately, before version 5 of Elasticsearch, fuzzy and non-fuzzy matches were treated equally, and so you were just as likely to have the fuzzy matches listed before the non-fuzzy matches.

To work around this problem, we perform both a fuzzy and a non-fuzzy query and stitch together the results in code, listing the non-fuzzy matches before the fuzzy ones.

The most recent release of Elasticsearch (Elasticsearch 5) now ranks non-fuzzy matches above fuzzy ones, and so this work-around may no longer be necessary.

Ignoring Separators

Completion suggesters can be configured to “ignore separators”. This means that the user can leave out all spaces and punctuation and they will still get appropriate suggestions.  This is especially useful for some multiword suburb and street names that are often mistyped.  For instance, “Row Ville” will match “Rowville”, and “Boxhill” will match “Box Hill”.

Synonyms

We wanted street type and suffix abbreviations “St”, “Rd”, “N”, “S”, etc. to be treated as synonyms of their full forms “Street”, “Road”, “North”, “South”, etc. To achieve this, we use synonym mappings. Specifically, at indexing time, we map the full form of each street-type to both its abbreviation and to itself. For example, “street” is mapped to both “st” and “street”. By mapping the street types to themselves as well as to their abbreviation in the indexed documents, we were able to avoid applying synonym analysis at query time.  This turned out to be an important performance consideration possibly deserving a blog post of its own.

We needed to avoid the “multiword synonym problem” that can occur when single term abbreviations expand to multiple words – e.g. “shwy” is an abbreviation of “state highway”. These cause the completion suggester to fail due to their effect on the word positions. We devised a simple workaround that exploits the fact that we enable the “ignore separators” flag: we stripped out the spaces between the words of multi-word synonyms in our synonym definitions.

For example, by specifying “statehighway” instead of “state highway” in the synonyms file, we can still match a user’s input of “State Highway”, “Statehighway”, or “Shwy”.  I.e. the synonym file has the following entry:

statehighway => statehighway, shwy

An indexed document might look like this:

{
  "suggest": {
    "output": "1 Moonah State Highway, West Moonah, Tas 7009",
    "input": [
      "1 Moonah statehighway, West Moonah, Tas 7009",
      ...
    ]
  }
}

 

The user can start typing either “1 Moonah Sta” or “1 Moonah Shw” and in both cases will be suggested “1 Moonah State Highway, West Moohah, Tas 7009”.

Sort order

We sort suggestions by street number then by unit number.  For example, a user query of “Jingella Ave” will return “1 Jingella Ave” before “2 Jingella Ave”, and will return “1/1 Jingella Ave” before “2/1 Jingella Ave”.

Elasticsearch completion suggester queries do not support explicit sorting of results at query time.  Instead, we took advantage of the fact that suggestions can be indexed with integer “weights”.  The larger the weight, the higher the result is ranked.

We computed weights based on a combination of street and unit numbers in a manner that ensures that the results are ranked first by street number, then by unit number.

We ran into a subtle problem caused by the fact that Elasticsearch converts the suggestion weights into single precision floating point numbers. Our initial implementation attempted to handle every possible combination of street and unit number from 1 to 10,000. This resulted in weights up to 100,000,000 (i.e. 10,0002). The single precision floating point representation of numbers above 224 (about sixteen million) is only approximate: i.e. different numbers often result in the same floating point representation. This might not have been all that important if it had only affected addresses with large street numbers, but it was addresses with small street numbers that had to be ranked highest and hence had to have the largest weights. As you can imagine, it took us a while to figure this out as Elasticsearch gives you no warning that this is happening.

In the end, we configured our formula to handle street and unit numbers up to 1,000.  Larger numbers have zero contributions to the weights.  This was more than enough for the vast majority of cases.

Alternative Autosuggest Implementations Considered

We investigated alternatives to completion suggesters, including those described in: “Implementing Autosuggest in Elasticsearch”. The conclusion of our investigations was that completion suggesters provided the best fit to our desired functionality, responsiveness, and simplicity.

For example, one approach considered, involved performing n-gram and edge n-gram analysis of the entire address as a un-tokenized string. This is how our main listings search site autosuggest currently handles simple suburb name autosuggest. This works well for suburbs names, but was not suitable for full address autosuggest for various reasons, including how it made synonym matching difficult, and the resulting prohibitively large inverted indexes.

Another approach considered was one that involved a combination of the standard tokeniser and various basic token filters. This required special pre-processing of the query text to identify the last term in the query as it was the term most likely to require completion, and so it needed to be matched against edge n-grams of the address terms.  This looked promising, but the completion suggesters looked simpler and able to provide the desired functionality.

Cluster Sizing

We chose to provision our Elasticsearch cluster with a sufficient number of sufficiently fast servers to allow us to create and feed new indices on a live cluster already handling production traffic on existing indices.

A big issue that we ran into was that the completion suggesters make heavy use of the JVM heap and can be prone to long garbage collection (GC) pauses, especially while feeding a new index.  Indeed, at one point we were suffering GC pauses lasting more than four seconds, which is not acceptable for an autosuggest system. Reducing the memory given to the JVM heap fixed the GC pauses.  We experimented using both Amazon EC2 C4 2xLarge (15Gb) and Amazon EC2 M4 2xLarge (32Gb).  Curiously, despite the fact that the C4 servers had half the memory of the M4 servers, in both cases we had to restrict the JVM heap to less than a third of the available memory.  In other words, the optimum JVM heap size was not an absolute value, but seemed to depend on the type of server and the total available memory.

In the end, our performance testing led us to host our Elasticsearch cluster on five EC2 C4 2xLarge servers. This ensured that feeding a new index on the cluster had little noticeable affect on the response times for queries on an existing index. Unfortunately, this means that most of the time our servers are very lightly loaded, and so we possibly need to reconsider our decision to feed new indices on actively used clusters.

Concluding Remarks

Using completion suggesters proved more challenging than expected, but we continue to believe that they were a good choice. At the time of writing, it has been about six months since we first released our Elasticsearch-based address autosuggest service. In that time, the service has been running very smoothly. Admittedly, it rarely has to handle more than a few hundred requests per minute, which is much smaller than the thousands of requests per minute that our suburb autosuggest service handles for our main listings search site. However, our performance testing indicates that we should be able to easily handle thousands of address autosuggest requests per minute with our existing Elasticsearch cluster.

The work we have done will allow us to easily make various enhancements that we believe will improve the user experience.