Reputation: 11777
We have been trying to explain why this happened in all of our datasets but so far we had no success.
We observed that starting on 18 April our ga_sessions dataset had for the most part duplicated entries (like 99% of rows). As an example, I tested this query:
SELECT
fullvisitorid fv,
visitid v,
ARRAY(
SELECT
AS STRUCT hits.*
FROM
UNNEST(hits) hits
ORDER BY
hits.hitnumber) h
FROM
`dafiti-analytics.40663402.ga_sessions*`
WHERE
1 = 1
AND REGEXP_EXTRACT(_table_suffix, r'.*_(.*)') BETWEEN FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))AND FORMAT_DATE("%Y%m%d", DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))
ORDER BY
fv,
v
LIMIT
100
And the result was:
We tried to investigate when this began to happen, so I ran this query:
SELECT
date,
f,
COUNT(f) freq from(
SELECT
date,
fullvisitorid fv,
visitid v,
COUNT(CONCAT(fullvisitorid, CAST(visitid AS string))) f
FROM
`dafiti-analytics.40663402.ga_sessions*`
WHERE
1 = 1
AND PARSE_TIMESTAMP('%Y%m%d', REGEXP_EXTRACT(_table_suffix, r'.*_(.*)')) BETWEEN TIMESTAMP('2017-04-01')
AND TIMESTAMP('2017-04-30')
GROUP BY
fv,
v,
date )
GROUP BY
f,
date
ORDER BY
date,
freq DESC
And we found that for 3 of our projects it started on day 18 April but in accounts related to LATAM data we started seeing duplicated rows just recently as well.
We also checked if in our GCP Console something was logged but couldn't find anything.
Is there some mistake we could have made that caused the duplication in the ga_sessions export? We checked our analytics tracking but it seems to be working just fine. Also there's no modification we did these days that explain it as well.
If you need more info please let me know.
Upvotes: 1
Views: 1339
Reputation: 33705
Make sure to match only the intraday or non-intraday tables. For intraday:
`dafiti-analytics.40663402.ga_sessions_intraday*`
For non-intraday:
`dafiti-analytics.40663402.ga_sessions_2017*`
The important part is to include enough of the prefix to match the desired tables.
Upvotes: 2