RICHARD HALLETT

2019-01-06 15:08

Elasticsearch random sampling

Problem

Recently I needed to be able to do random sampling of query results using an Elasticsearch index. This also needed not only to return a random set of results (i.e. random order) but in addition to return the results grouped by a field.

Solution

Random samples

Returning random results, this part is easily done with Random Scoring within a Function Score Query:

{
  "query": {
    "function_score": {
        "random_score": {}
    }
  }
}

Note: Random score function can take an optional seed, but without one, it uses current timestamp (check your ES version though)

One thing to consider is that if you want to combine this with another query, then you want to put that query as part of the function score itself e.g.

{
  "query": {
    "function_score": {
        "query": {
            "bool" : {
                "must" : {
                    "term" : { "user" : "rph" }
                }
            }
        },
        "random_score": {}
    }
  }
}

Grouped random samples

In order to do a grouping, the recommended approach is to use the top_hits aggregation, this is mentioned within ES's own documentation.

{
   "query":{
      "function_score":{
         "random_score":{

         }
      }
   },
   "aggs":{
      "random_sample_groups":{
         "terms":{
            "field":"my_field_to_group_by",
            "size":10
         },
         "aggs":{
            "random_samples":{
               "top_hits":{
                  "size":1
               }
            }
         }
      }
   }
}

Here we use a terms aggregation to group by and then sub aggregate with top_hits to get the actual samples.

  • The size within the terms aggregation, is used for the max number of groups, if you don't know how many you might want to look into composite aggregations that let you paginate.
  • The top_hits size refers to the number of hits you want to retrieve, so in this example we just want 1 from each group.

It is important to realise your query results are actually in the aggregation not in the main result set, so when parsing you will want to parse out the _source hits to return within your application. If you don't wish to return any results in the main query, just add a size of 0 to the main query.

Final thoughts

You can further combine as required to build up more complex random result querying, or just return subsets of information as required, but this is the basics of how to get random grouped sampling from ES.