c_froehlich
c_froehlich

Reputation: 1575

Elasticsearch: group into buckets, reduce to one document per bucket, group these documents

I'm looking for a way how to compute the bounce rate of webpages with elastic search.

We collect data in the following simplified structure

{"id":"1", "timestamp"="2017-01-25:15:23", "sessionid"="s1", "page"="index"}
{"id":"2", "timestamp"="2017-01-25:15:24", "sessionid"="s1", "page"="checkout"}
{"id":"3", "timestamp"="2017-01-25:15:25", "sessionid"="s1", "page"="confirm"}

{"id":"4", "timestamp"="2017-01-25:15:26", "sessionid"="s2", "page"="index"}
{"id":"5", "timestamp"="2017-01-25:15:27", "sessionid"="s2", "page"="checkout"}

{"id":"6", "timestamp"="2017-01-25:15:26", "sessionid"="s3", "page"="product_a"}
{"id":"7", "timestamp"="2017-01-25:15:28", "sessionid"="s3", "page"="checkout"}

For this sample the result of the analysis should be:

2/3 of the users get lost at the checkout page.

1/3 of the users get lost at the confirm page

More formally, I'm looking for a generic approach how to implement the following algorithm in an elastic query:

  1. group documents by a field
  2. sort each group (bucket) by a second field and reduce to the topmost document
  3. group all these remaining documents by a third field
  4. sort groups by number of documents

My first attempt was to solve this with a terms aggregation followed by a top_hits aggregation and finally use a terms_pipeline aggregation to group the pages.

(simplified aggregation structure)

aggs
    terms
        field: sessionid
        aggs
            top_hits
                sort:timestamp desc
                size: 1
    terms_pipeline
        bucket_path: terms>top_hits
        field: page

... but unfortunately there is no such thing like a terms_pipeline aggregation. My bad.

Any ideas for an alternative approach?

Upvotes: 0

Views: 817

Answers (1)

Val
Val

Reputation: 217254

Maybe I misunderstood something but if you are willing to know where your users are bouncing, since all pages are in a sequence, you could simply have a terms aggregation on the page field (to know which pages were visited) and a cardinalityone on the sessionid field (to know how many different unique sessions you have). In this case, cardinality(sessionid) would yield 3.

Then again, since all pages are in a sequence, I think you don't really need to know what happened within a given session.

In your example, from the terms(page) aggregation, you'd know that 3 users landed on the checkout page but only one went to the confirm one. Using the cardinality of the sessions, this implicitly means that 2 users (3 total sessions - 1 confirm page hit) bounced on the checkout page.

Upvotes: 0

Related Questions