Willian Fuks
Willian Fuks

Reputation: 11777

BigQuery GA Exported with Duplicated Rows

We have been trying to explain why this happened in all of our datasets but so far we had no success.

We observed that starting on 18 April our ga_sessions dataset had for the most part duplicated entries (like 99% of rows). As an example, I tested this query:

SELECT
  fullvisitorid fv,
  visitid v,
  ARRAY(
  SELECT
    AS STRUCT hits.*
  FROM
    UNNEST(hits) hits
  ORDER BY
    hits.hitnumber) h
FROM
  `dafiti-analytics.40663402.ga_sessions*`
WHERE
  1 = 1
  AND REGEXP_EXTRACT(_table_suffix, r'.*_(.*)') BETWEEN FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))AND FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))
ORDER BY
  fv,
  v
LIMIT
  100

And the result was:

enter image description here

We tried to investigate when this began to happen, so I ran this query:

SELECT
  date,
  f,
  COUNT(f) freq from(
  SELECT
    date,
    fullvisitorid fv,
    visitid v,
    COUNT(CONCAT(fullvisitorid, CAST(visitid AS string))) f
  FROM
    `dafiti-analytics.40663402.ga_sessions*`
  WHERE
    1 = 1
    AND PARSE_TIMESTAMP('%Y%m%d', REGEXP_EXTRACT(_table_suffix, r'.*_(.*)')) BETWEEN TIMESTAMP('2017-04-01')
    AND TIMESTAMP('2017-04-30')
  GROUP BY
    fv,
    v,
    date )
GROUP BY
  f,
  date
ORDER BY
  date,
  freq DESC

And we found that for 3 of our projects it started on day 18 April but in accounts related to LATAM data we started seeing duplicated rows just recently as well.

We also checked if in our GCP Console something was logged but couldn't find anything.

Is there some mistake we could have made that caused the duplication in the ga_sessions export? We checked our analytics tracking but it seems to be working just fine. Also there's no modification we did these days that explain it as well.

If you need more info please let me know.

Upvotes: 1

Views: 1339

Answers (1)

Elliott Brossard
Elliott Brossard

Reputation: 33705

Make sure to match only the intraday or non-intraday tables. For intraday:

`dafiti-analytics.40663402.ga_sessions_intraday*`

For non-intraday:

`dafiti-analytics.40663402.ga_sessions_2017*`

The important part is to include enough of the prefix to match the desired tables.

Upvotes: 2

Related Questions