Willian Fuks
Willian Fuks

Reputation: 11777

Total Sessions in BigQuery vs Google Analytics Reports

I'm just learning BigQuery so this might be a dumb question, but we want to get some statistics there and one of those is the total sessions in a given day.

To do so, I've queried in BQ:

select sum(sessions) as total_sessions from (
  select
    fullvisitorid,
    count(distinct visitid) as sessions,
    from (table_query([40663402], 'timestamp(right(table_id,8)) between timestamp("20150519") and timestamp("20150519")'))
    group each by fullvisitorid
)

(I'm using the table_query because later on we might increase the range of days)

This results in 1,075,137.

But in our Google Analytics Reports, in the "Audience Overview" section, the same day results:

This report is based on 1,026,641 sessions (100% of sessions).

There's always this difference of roughly ~5% despite of the day. So I'm wondering, even though the query is quite simple, is there any mistake we've made?

Is this difference expected to happen? I read through BigQuery's documentation but couldn't find anything on this issue.

Thanks in advance,

Upvotes: 11

Views: 17867

Answers (4)

Androula Alekou
Androula Alekou

Reputation: 1

What worked for me was this:

SELECT count(distinct sessionId) FROM(
   SELECT CONCAT(clientId, "-", visitNumber, "-", date) as sessionId FROM `project-id.dataset-id.ga_sessions_*`
   WHERE _table_suffix BETWEEN "20191001" AND "20191031" AND totals.visits = 1)

The explanation (found very well written in this article: https://adswerve.com/blog/google-analytics-bigquery-tips-users-sessions-part-one/) is that when counting and dealing with sessions we should be careful because by default, Google Analytics breaks sessions that carryover midnight (time zone of the view). Therefore a same session can end up in two daily tables:

Image from article mentioned above

The code provided creates a sessionID by combining: client id + visit number + date while acknowledging the session break; the result will be in a human-readable format. Finally to match sessions in the Google Analytics UI, make sure to filter to only those with totals.visits = 1.

Upvotes: 0

Martin Weitzmann
Martin Weitzmann

Reputation: 4736

Simply SUM(totals.visits) or when using COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING) )) make sure totals.visits=1!

If you use visitId and you are not grouping per day, you will combine midnight-split-sessions!

Here are all scenarios:

SELECT
  COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitStartTime AS STRING) )) allSessionsUniquePerDay,
  COUNT(DISTINCT CONCAT(fullVisitorId, CAST(visitId AS STRING) )) allSessionsUniquePerSelectedTimeframe,
  sum(totals.visits) interactiveSessionsUniquePerDay, -- equals GA UI sessions
  COUNT(DISTINCT IF(totals.visits=1, CONCAT(fullVisitorId, CAST(visitId AS STRING)), NULL) ) interactiveSessionsUniquePerSelectedTimeframe,
  SUM(IF(totals.visits=1,0,1)) nonInteractiveSessions
FROM
  `project.dataset.ga_sessions_2017102*`

Wrap up:

  • fullVisitorId + visitId: useful to reconnect midnight-splits
  • fullVisitorId + visitStartTime: useful to take splits into account
  • totals.visits=1 for interaction sessions
  • fullVisitorId + visitStartTime where totals.visits=1: GA UI sessions (in case you need a session id)
  • SUM(totals.visits): simple GA UI sessions
  • fullVisitorId + visitId where totals.visits=1 and GROUP BY date: GA UI sessions with too many chances for errors and misunderstandings

Upvotes: 18

Willian Fuks
Willian Fuks

Reputation: 11777

After posting the question we got into contact with Google support and found that in Google Analytics only sessions that had an "event" being fired are actually counted.

In Bigquery you will find all sessions regardless whether they had an interaction or not.

In order to find the same result as in GA, you should filter by sessions with totals.visits = 1 in your BQ query (totals.visits is 1 only for sessions that had an event being fired).

That is:

select sum(sessions) as total_sessions from (
  select
    fullvisitorid,
    count(distinct visitid) as sessions,
    from (table_query([40663402], 'timestamp(right(table_id,8)) between timestamp("20150519") and timestamp("20150519")'))
    where totals.visits = 1
    group each by fullvisitorid
)

Upvotes: 10

oortCloud
oortCloud

Reputation: 496

The problem could be due to "COUNT DISTINCT".

According to this post:

COUNT DISTINCT is a statistical approximation for all results greater than 1000

You could try setting an additional COUNT parameter to improve accuracy at the expense of performance (see post), but I would first try:

SELECT COUNT( CONCAT( fullvisitorid,'_', STRING(visitid))) as sessions 
from (table_query([40663402], 'timestamp(right(table_id,8)) between 
timestamp("20150519") and timestamp("20150519")'))

Upvotes: 1

Related Questions