Pentium10
Pentium10

Reputation: 207982

Filtering on multiple customDimensions and then aggregating

The data that comes out from the BigQuery implementation of GoogleAnalytics raw data looks like this:

|-visitId
|- date
|- (....)
+- hits
   |- time
   |- page
      |- pagePath
   |- eventInfo
      |- eventAction
   +- customDimensions
      |- index
      |- value

I am looking to grab 3 values out from the repeated customDimensions like

+---------+---------+-------+-----------+---------------+
| user_id | country | split | page Hits | CTA event hit |
+---------+---------+-------+-----------+---------------+
| 100     | US      | A     | 25000     | 500           |
+---------+---------+-------+-----------+---------------+
| 100     | US      | B     | 8000      | 90            |
+---------+---------+-------+-----------+---------------+
| 200     | ES      | A     | 400       | 2             |
+---------+---------+-------+-----------+---------------+

the first three columns are defined by hits.customDimensions.index 1,4,7.

The page hit SUM is how many views they did, the CTA event hit is sum of the event that is fired if they click a button on a page itself. For the sake of SQL simplicity we can name hits.page.pagePath='tshirt' and hits.eventInfo.eventAction='upsell'

I am having difficulty reading the 3 customDimensions out from the same repated field, then I am having difficulty locating the events that happened in the same session.

Update for those that are not familiar with BQ dataset

In the below image each line is a hit, and multiple hits can be on the same row. In BigQuery that is called REPEATED field. In the image you see 3 taller rows. The first row has 8 hit. The image doesn't contain multiple customDimensions, but that can be multiple for the same hit. To access a sample DB set on BigQuery read here it's free.

enter image description here

Upvotes: 2

Views: 469

Answers (2)

Willian Fuks
Willian Fuks

Reputation: 11787

Before answering, I'd like to show the mock data that I used as a guide to come up with a solution, hopefully it will be useful:

WITH mock_data AS(
select '0' fullvisitorid, 0 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(0 as hitnumber, [STRUCT(0 as index, '' as value)] as customdimensions, STRUCT('/home' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all
select '0' fullvisitorid, 0 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(1 as hitnumber, [STRUCT(0 as index, '' as value)] as customdimensions, STRUCT('/randompage' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all
select '0' fullvisitorid, 0 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(2 as hitnumber, [STRUCT(1 as index, '000' as value), STRUCT(4 as index, 'US' as value), STRUCT(7 as index, 'A' as value)] as customdimensions, STRUCT('tshirt' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all
select '0' fullvisitorid, 0 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(3 as hitnumber, [STRUCT(1 as index, '000' as value), STRUCT(4 as index, 'US' as value), STRUCT(7 as index, 'A' as value)] as customdimensions, STRUCT('tshirt' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all

select '1' fullvisitorid, 0 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(0 as hitnumber, [STRUCT(0 as index, '' as value)] as customdimensions, STRUCT('/home' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all
select '1' fullvisitorid, 0 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(1 as hitnumber, [STRUCT(0 as index, '' as value)] as customdimensions, STRUCT('/randompage' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all
select '1' fullvisitorid, 0 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(2 as hitnumber, [STRUCT(1 as index, '100' as value), STRUCT(4 as index, 'US' as value), STRUCT(7 as index, 'A' as value)] as customdimensions, STRUCT('tshirt' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all
select '1' fullvisitorid, 0 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(3 as hitnumber, [STRUCT(0 as index, '' as value)] as customdimensions, STRUCT('tshirt' as pagepath) as page, STRUCT('specific_category' as eventcategory, 'specific_label' as eventlabel, 'upsell' as eventaction) as eventinfo)] hits union all
select '1' fullvisitorid, 0 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(4 as hitnumber, [STRUCT(1 as index, '100' as value), STRUCT(4 as index, 'US' as value), STRUCT(7 as index, 'A' as value)] as customdimensions, STRUCT('tshirt' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits  union all

select '2' fullvisitorid, 0 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(0 as hitnumber, [STRUCT(0 as index, '' as value)] as customdimensions, STRUCT('/home' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all
select '2' fullvisitorid, 0 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(1 as hitnumber, [STRUCT(0 as index, '' as value)] as customdimensions, STRUCT('/randompage' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all
select '2' fullvisitorid, 0 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(2 as hitnumber, [STRUCT(0 as index, '' as value)] as customdimensions, STRUCT('tshirt' as pagepath) as page, STRUCT('specific_category' as eventcategory, 'specific_label' as eventlabel, 'upsell' as eventaction) as eventinfo)] hits union all

select '3' fullvisitorid, 0 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(0 as hitnumber, [STRUCT(0 as index, '' as value)] as customdimensions, STRUCT('/home' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all
select '3' fullvisitorid, 0 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(1 as hitnumber, [STRUCT(0 as index, '' as value)] as customdimensions, STRUCT('/randompage' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all
select '3' fullvisitorid, 1 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(2 as hitnumber, [STRUCT(1 as index, '300' as value), STRUCT(4 as index, 'US' as value), STRUCT(7 as index, 'A' as value)] as customdimensions, STRUCT('tshirt' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all
select '3' fullvisitorid, 1 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(3 as hitnumber, [STRUCT(1 as index, '300' as value), STRUCT(4 as index, 'US' as value), STRUCT(7 as index, 'A' as value)] as customdimensions, STRUCT('tshirt' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all

select '4' fullvisitorid, 0 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(0 as hitnumber, [STRUCT(0 as index, '' as value)] as customdimensions, STRUCT('/home' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all
select '4' fullvisitorid, 0 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(1 as hitnumber, [STRUCT(0 as index, '' as value)] as customdimensions, STRUCT('/randompage' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all
select '4' fullvisitorid, 0 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(2 as hitnumber, [STRUCT(1 as index, '400' as value), STRUCT(4 as index, 'BR' as value), STRUCT(7 as index, 'B' as value)] as customdimensions, STRUCT('tshirt' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all
select '4' fullvisitorid, 1 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(0 as hitnumber, [STRUCT(0 as index, '' as value)] as customdimensions, STRUCT('tshirt' as pagepath) as page, STRUCT('specific_category' as eventcategory, 'specific_label' as eventlabel, 'upsell' as eventaction) as eventinfo)] hits union all
select '4' fullvisitorid, 1 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(1 as hitnumber, [STRUCT(1 as index, '400' as value), STRUCT(4 as index, 'BR' as value), STRUCT(7 as index, 'B' as value)] as customdimensions, STRUCT('tshirt' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all

select '4' fullvisitorid, 2 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(0 as hitnumber, [STRUCT(1 as index, '400' as value), STRUCT(4 as index, 'BR' as value), STRUCT(7 as index, 'B' as value)] as customdimensions, STRUCT('tshirt' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all
select '4' fullvisitorid, 2 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(1 as hitnumber, [STRUCT(0 as index, '' as value)] as customdimensions, STRUCT('tshirt' as pagepath) as page, STRUCT('specific_category' as eventcategory, 'specific_label' as eventlabel, 'upsell' as eventaction) as eventinfo)] hits union all
select '4' fullvisitorid, 2 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(2 as hitnumber, [STRUCT(0 as index, '' as value)] as customdimensions, STRUCT('tshirt' as pagepath) as page, STRUCT('specific_category' as eventcategory, 'specific_label' as eventlabel, 'upsell' as eventaction) as eventinfo)] hits union all
select '4' fullvisitorid, 3 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(0 as hitnumber, [STRUCT(1 as index, '400' as value), STRUCT(4 as index, 'BR' as value), STRUCT(7 as index, 'A' as value)] as customdimensions, STRUCT('tshirt' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits union all
select '4' fullvisitorid, 3 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(1 as hitnumber, [STRUCT(0 as index, '' as value)] as customdimensions, STRUCT('tshirt' as pagepath) as page, STRUCT('specific_category' as eventcategory, 'specific_label' as eventlabel, 'upsell' as eventaction) as eventinfo)] hits union all
select '4' fullvisitorid, 3 visitid, ARRAY<STRUCT< hitnumber INT64, customdimensions ARRAY<STRUCT<index INT64, value STRING>>, page STRUCT<pagepath STRING>, eventinfo STRUCT<eventcategory STRING, eventlabel STRING, eventaction STRING> >> [STRUCT(2 as hitnumber, [STRUCT(1 as index, '400' as value), STRUCT(4 as index, 'BR' as value), STRUCT(7 as index, 'A' as value)] as customdimensions, STRUCT('tshirt' as pagepath) as page, STRUCT('' as eventcategory, '' as eventlabel, '' as eventaction) as eventinfo)] hits
)

I've simulated 4 distinct users visiting the website using the same schema we find in BigQuery ga_sessions table.

Some of my assumptions might be a bit different from your actual data. If this is the case, please let me know and we can adapt the mock data as a guide for more accurate answers (I actually use these mocks to run integration tests in our production environment so they can be helpful).

The assumptions I did was (correct me if I'm wrong):

  1. The customDimensions are only fired when hits.page.pagepath=tshirt
  2. They are always fired. That is, every visit to the "tshirt" page corresponds to a triggering of the customs.
  3. When the eventAction click happens, customDimensions are not fired at the same time (i.e, the events are fired in one hitNumber and the customized events into another).

This might give the result expected:

select
    user_id,
    country,
    _split,
    sum(page_hits) page_hits,
    sum(CTA_event_hit) CTA_event_hit
from(
select
    max(user_id) user_id,
    max(country) country,
    max(_split) _split,
    max(page_hits) page_hits,
    max(CTA_event_hit) CTA_event_hit
from(
select
    fv,
    v,
    user_id,
    country,
    _split,
    count(case when user_id is not null then 1 end) page_hits,
    sum(click_flag) CTA_event_hit
from(
select 
    fullvisitorid fv,
    visitid v,
    (select custd.value from unnest(hits.customdimensions) custd where custd.index = 1) user_id,
    (select custd.value from unnest(hits.customdimensions) custd where custd.index = 4) country,
    (select custd.value from unnest(hits.customdimensions) custd where custd.index = 7) _split,
    case when hits.eventinfo.eventcategory = 'specific_category' and hits.eventinfo.eventlabel = 'specific_label' and hits.eventinfo.eventaction = 'upsell' then 1 end click_flag
from mock_data,
unnest(hits) hits
where 1 = 1
    and hits.page.pagepath = 'tshirt'
)
group by fv, v, user_id, country, _split
)
group by fv, v
having user_id is not null
)
group by user_id, country, _split

That results in:

enter image description here

Basically, the are some subselect queries to retrieve user_id, country and split. For every session (visitid) the data is aggregated by using the MAX operator and finally there's a final aggregation on the user_id, country and split level.

In other to query on your dataset you just need to change mock_data to the correspondent ga_session desired table.

Not sure if this will solve your problem but might be of some help.

As a final note, it seems like this data is the setup for maybe an AB test or some analyzes of performances for different variations of your site. This being the case, I'd recommend not allowing users to change their split value as this can lead to some poisoning of the data (which might skew results).

Upvotes: 2

Mosha Pasumansky
Mosha Pasumansky

Reputation: 14014

To make sure I understood the problem, I only provide here solution for computing custom columns and page hits metric, but not (yet) for CTA event hit. Using sample GA table and Standard SQL, it can look something like this:

SELECT 
  ARRAY(SELECT AS STRUCT c.product, c.color, 1 page_hits 
  FROM t.hits hit CROSS JOIN
      UNNEST(ARRAY(
        SELECT DISTINCT AS STRUCT
          if(dim.index = 1, dim.value, NULL) product,
          if(dim.index = 2, dim.value, NULL) color
        FROM hit.customDimensions dim 
        WHERE dim.index in (1,2))) c
  )
FROM `google.com:analytics-bigquery.LondonCycleHelmet.ga_sessions_20130910` t

Basically, in the inner SELECT we convert customDimensions.index into separate columns (product and color in this example), and then the outer SELECT prepares to count them by setting page_hits to 1 for every hit.

Upvotes: 1

Related Questions