Using Elasticsearch field collapsing to group related search results

Realestate “Developer Projects” are an important feature offered by REA to our customers (i.e. realestate agencies and property developers). An agency is able to use projects to group related listings so that they are displayed together as a group in our search results.  To implement this feature, we have chosen to use Elasticsearch’s support for  “field collapsing”.

The remainder of this article includes a summary of the problem, the approaches considered, and a bit more detail on the chosen approach.

What are Projects?

A typical use of projects is one where a building developer, instead of creating a listing for every single apartment in the building, creates a listing for each of the different types of apartments and groups them under a single “project”.  When a user’s search matches any of the listings in the project (i.e. matches on the search price, location, bedroom count, etc.), then the first five of the project’s matching listings are grouped together and displayed along with an address, title, and image of the project itself.  Here is an example project returned by a search for two bedroom apartments in Brunswick:

Project Search Result

Elasticsearch modelling options

We are re-platforming our realestate listings search using Elasticsearch as our search engine. Elasticsearch offers only a limited set of options for expressing relationships between things like the relationship between a project and its child (member) listings.  Unlike a relational database, it is not possible to simply put projects and listings in separate “tables” and perform some sort of join of the two.

There were three main options considered:

  1. nested-objects: modelling projects as single documents containing a nested array of child listings.
  2. parent/child: modelling projects and child listings as separate documents and taking advantage of Elasticsearch’s support for parent/child relationships.
  3. field collapsing: only storing child listing documents, but duplicating project information on each child listing and “collapsing” the results on the project id. This is explained further below.

When considering the three options, it was not obvious what the relative impacts would be on either the code complexity or performance. So we spiked each option.  We found all three to be feasible approaches, but chose field collapsing because it had the:

  • best performance
  • least impact on the structure of the Elasticsearch queries

The remainder of this article provides an introduction to field collapsing and explains how we use it.

Field Collapsing

This feature was only introduced to Elasticsearch in mid 2017 and so was quite new when we kicked off this investigation.  It is available in other search engines such as Solr and FAST. Field collapsing is a query-time directive that, when combined with the optional “inner-hits” sub-directive, results in Elasticsearch grouping the results by a specified field.

To be able to use field collapsing for grouping together project results, we need to insert a separate document for every child listing, and each of these must include the project id.  For example:

# Documents for project 6000
{"projectId": "6000", "price": 500000, "bedrooms": 2, "title": "Affordable luxury"}
{"projectId": "6000", "price": 700000, "bedrooms": 4, "title": "Spacious"}

# Documents for project 6001
{"projectId": "6001", "price": 550000, "bedrooms": 2, "title": "Stunning"}
{"projectId": "6001", "price": 650000, "bedrooms": 3, "title": "Excellent views"}

To understand how field collapsing works, let’s start with an ordinary search request without field collapsing such as this one:

{
  "query": { "range": {  "bedrooms": {"gte": 2}} },
  "sort": [{"price": "desc"}]
}

The results from this would have listings from different projects interleaved. Notice how the two listings for project 6000 are separated by the two listings for project 6001 (for the sake of brevity, most of the meta data fields have been removed):

{
  "_id": "1001",
  "_source": {"projectId": "6000", "price": 700000, "bedrooms": 4, "title": "Spacious"}
},
{
  "_id": "1003",
  "_source": {"projectId": "6001", "price": 650000, "bedrooms": 3, "title": "Excellent views"}
},
{
  "_id": "1002",
  "_source": {"projectId": "6001", "price": 550000, "bedrooms": 2, "title": "Stunning"}
},
{
  "_id": "1000",
  "_source": {"projectId": "6000", "price": 500000, "bedrooms": 2, "title": "Affordable luxury"}
}

To do field collapsing, all you need to do is add a “collapse” directive that specifies the field upon which you wish to “collapse” – in our case it will be the project id.  The default behaviour is to return one document for each group of documents having the same collapse field value – i.e. only one listing per project.  For our particular case, we want the first five documents for each project id, and we want them grouped together with the first match in each group.  Fortunately, the collapse directive takes an “inner_hits” directive that does exactly what we want.

So a search with field collapsing will look something like this:

{
  "query": {"range": {"bedrooms": {"gte": 2}}},
  "sort": [{"price": "desc"}],
  "collapse": {
    "field": "projectId",
    "inner_hits": {"size": 5, ... }
  }
}

Each result will look something like this (for the sake of brevity, the response has been simplified a little):

{
  "_id": "1001", ← id of the first matching child listing in project 6000
  "_source": {"projectId": "6000", "price": 700000, ...}, ← the first matching child
  "inner_hits": { ← inner-hits contains the first 5 matching hits
    "total": 2,
    "hits": [
      {
        "_id": "1001", ← note how the first matching child is repeated here in the inner-hits
        "_source": {"projectId": "6000", "price": 700000, ...}
      },
      {
        "_id": "1000",
        "_source": {"projectId": "6000", "price": 500000, ...}
      }
    ]
  }
},
{
  "_id": "1003", ← id of the first matching child listing in project 60001
  "_source": {"projectId": "6001", "price": 650000, ...},
  "inner_hits": {
    "total": 2,
    "hits": [
      {
        "_id": "1003",
        "_source": {"projectId": "6001", "price": 650000, ...}
      },
      {
        "_id": "1002",
        "_source": {"projectId": "6001", "price": 550000, ...}
      }
    ]
  }
}

Additional Issues

Some other issues that may be of interest:

  • We need to support listings that are “stand-alone” as well as those that belong to projects.  Unfortunately this means that we need to provide our stand-alone listings with artificial project id’s – otherwise all the stand alone listings get grouped together as if they all belonged to a single project!
  • Our searches need to be able to match against some project level information as well as listing level information. For example, our keyword filtering can match against the project name and description in addition to the title and description on each child listing.  All of this project level information needs to be duplicated on each child listing. We make use of Elasticsearch’s support for “_source” directives at both the top-level and at the inner-hits level to limit the fields returned:
    • The top level hit responses only include the project level information
    • The inner-hits for each response only contain the child listing level information.

Conclusion

We have successfully implemented support for realestate developer projects through the use of Elasticsearch’s field collapsing feature.  This has turned out to be reasonably straightforward, had little impact on how we model listings in Elasticsearch, and had only a minor impact on our query time performance.

Leave a Reply

Your email address will not be published. Required fields are marked *