Warren Shore
Warren Shore

Reputation: 206

Elasticsearch Histogram of visits

I'm quite new to Elasticsearch and I fail to build a histogram based on ranges of visits. I am not even sure that it's possible to create this kind of chart by using a single query in Elasticsearch, but I'm the feeling that could be possible with pipeline aggregation or may be scripted aggregation.

Here is a test dataset with which I'm working:

PUT /test_histo
{ "settings": { "number_of_shards": 1 }}

PUT /test_histo/_mapping/visit
{
   "properties": {
      "user": {"type": "string" },
      "datevisit": {"type": "date"},
      "page": {"type": "string"}
   }
}

POST test_histo/visit/_bulk
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"John","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Jean","page":"productXX.hmtl","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Robert","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Mary","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"Mary","page":"media_center.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"John","page":"home.html","datevisit":"2015-11-25"}
{"index":{"_index":"test_histo","_type":"visit"}}
{"user":"John","page":"media_center.html","datevisit":"2015-11-26"}

If we consider the ranges [1,2[, [2,3[, [3, inf.[

The expected result should be :

All my efforts to find the histogram showing a customer visit frequency remained to date unsuccessful. I would be pleased to have a few tips, tricks or ideas to get a response to my problem.

Upvotes: 6

Views: 528

Answers (3)

Sumit
Sumit

Reputation: 2280

There are two ways you can do it.

First is doing it in ElasticSearch which will require Scripted Metric Aggregation. You can read more about it here.

Your query would look like this

{
  "size": 0,
  "aggs": {
    "visitors_over_time": {
      "date_histogram": {
        "field": "datevisit",
        "interval": "week"
      },
      "aggs": {
        "no_of_visits": {
          "scripted_metric": {
            "init_script": "_agg['values'] = new java.util.HashMap();",
            "map_script": "if (_agg.values[doc['user'].value]==null) {_agg.values[doc['user'].value]=1} else {_agg.values[doc['user'].value]+=1;}",
            "combine_script": "someHashMap = new java.util.HashMap();for(x in _agg.values.keySet()) {value=_agg.values[x];if(value<3){key='[' + value +',' + (value + 1) + '[';}else{key='[' + value +',inf[';}; if(someHashMap[key]==null){someHashMap[key] = 1}else{someHashMap[key] += 1}}; return someHashMap;"
          }
        }
      }
    }
  }
}

where you can change period of time in date_histogram object in the field interval by values like day, week, month.

Your response would look like this

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 7,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "visitors_over_time": {
      "buckets": [
        {
          "key_as_string": "2015-11-23T00:00:00.000Z",
          "key": 1448236800000,
          "doc_count": 7,
          "no_of_visits": {
            "value": [
              {
                "[2,3[": 1,
                "[3,inf[": 1,
                "[1,2[": 2
              }
            ]
          }
        }
      ]
    }
  }
} 

Second method is to the work of scripted_metric in client side. You can use the result of Terms Aggregation. You can read more about it here.

Your query will look like this GET test_histo/visit/_search

{
  "size": 0,
  "aggs": {
    "visitors_over_time": {
      "date_histogram": {
        "field": "datevisit",
        "interval": "week"
      },
      "aggs": {
        "no_of_visits": {
          "terms": {
            "field": "user",
            "size": 10
          }
        }
      }
    }
  }
}

and the response will be

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 7,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "visitors_over_time": {
      "buckets": [
        {
          "key_as_string": "2015-11-23T00:00:00.000Z",
          "key": 1448236800000,
          "doc_count": 7,
          "no_of_visits": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "john",
                "doc_count": 3
              },
              {
                "key": "mary",
                "doc_count": 2
              },
              {
                "key": "jean",
                "doc_count": 1
              },
              {
                "key": "robert",
                "doc_count": 1
              }
            ]
          }
        }
      ]
    }
  }
}

where on the response you can do count for each doc_count for each period.

Upvotes: 3

xecgr
xecgr

Reputation: 5193

Have a look to this solution:

{
    "query": {
        "match_all": {}
    },
    "aggs": {
        "periods": {
            "filters": {
                "filters": {
                    "1-2": {
                        "range": {
                            "datevisit": {
                                "gte": "2015-11-25", 
                                "lt": "2015-11-26"
                            }
                        }
                    }, 
                    "2-3": {
                        "range": {
                            "datevisit": {
                                "gte": "2015-11-26", 
                                "lt": "2015-11-27"
                            }
                        }
                    }, 
                    "3-": {
                        "range": {
                            "datevisit": {
                                "gte": "2015-11-27", 
                            }
                        }
                    }
                }
            },
            "aggs": {
                "users": {
                    "terms": {"field": "user"}
                }
            }
        }
    }
}

Step by step:
Filter aggregation: You can define ranged values for the next aggregation, in this case we define 3 periods based on date range filter
Nested Users aggregation: This aggregation returns as many results as filters you'd defined. So, in this case, you'll get 3 values using range date filtering

You'll get a result like this:

{   
    ...
    "aggregations" : {
        "periods" : {
            "buckets" : {
                "1-2" : {
                    "users" : {
                        "buckets" : [
                            {"key" : XXX,"doc_count" : NNN},
                            {"key" : YYY,"doc_count" : NNN},
                        ]
                    }
                },
                "2-3" : {
                    "users" : {
                        "buckets" : [
                            {"key" : XXX1,"doc_count" : NNN1},
                            {"key" : YYY1,"doc_count" : NNN1},
                        ]
                    }
                },
                "3-" : {
                    "users" : {
                        "buckets" : [
                            {"key" : XXX2,"doc_count" : NNN2},
                            {"key" : YYY2,"doc_count" : NNN2},
                        ]
                    }
                },
            }
        }
    }
}

Try it, and tell if it works

Upvotes: 0

Richard L
Richard L

Reputation: 1221

Have a look at:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html

If you whant to show it in fancy already fixed UI use Kibana.

A query like this:

GET _search
{
   "query": {
      "match_all": {}
   }, 
   {
    "aggs" : {
        "visits" : {
            "date_histogram" : {
                "field" : "datevisit",
                "interval" : "month"
            }
        }
    }
}
}

Should give you a histogram, I don't have elastic here at the moment so I might have some fat finggered typos.

Then you could ad query terms to only show histogram for specific page our you could have an aouter aggregation bucket wich aggregates / page or user.

Something like this:

GET _search
{
   "query": {
      "match_all": {}
   }, 
   {
       {
    "aggs" : {
        "users" : {
            "terms" : {
                "field" : "user",
            },
        "aggs" : {
            "visits" : {
                "date_histogram" : {
                    "field" : "datevisit",
                    "interval" : "month"
                }
            }
        }
  }
}

Upvotes: 0

Related Questions