Vesa Palonen
Vesa Palonen

Reputation: 282

Firebase events dedup in Big Query - best practices?

There seems to be 1-2% of duplicates in the Firebase analytics events exported to Big Query. What are the best practices to remove these?

Atm the client does not send a counter with the events (per session). This would provide an unambiguous way of removing duplicate events, so I recommend Firebase implementing that. However, at the moment, what would be a good way to remove the duplicates? Look at client user_pseudo_id, event_timestamp, and event_name - fields and remove all except one with same triple?

How does event_bundle_sequence_id -field work? Will duplicates have the same value in this field, or different? That is, are duplicate events sent within the same bundle, or in different bundles?

Is Firebase planning to remove these duplicates earlier in the processing, either for Firebase analytics itself, or in the export to Big Query?

Standard SQL to check for duplicates in one days events:

with n_dups as
(
SELECT event_name, event_timestamp, user_pseudo_id, count(1)-1 as n_duplicates
FROM `project.dataset.events_20190610`
group by event_name, event_timestamp, user_pseudo_id
)
select n_duplicates, count(1) as n_cases
from n_dups
group by n_duplicates
order by n_cases desc

Upvotes: 18

Views: 1756

Answers (1)

Maksym Skorupskyi
Maksym Skorupskyi

Reputation: 318

We use the QUALIFY clause for deduplication Firebase events in BigQuery:

SELECT
  *
FROM
  `project.dataset.events_*`
QUALIFY
  ROW_NUMBER() OVER (
    PARTITION BY
      user_pseudo_id,
      event_name,
      event_timestamp,
      TO_JSON_STRING(event_params)
    ) = 1

Qualifying columns:

  - name: user_pseudo_id
    description:  Autogenerated pseudonymous ID for the user -
                  Unique identifier for a specific installation of application on a client device,
                  e.g. "938642951.1666427135".
                  All events generated by that device will be tagged with this pseudonymous ID,
                  so that you can relate events from the same user together.

  - name: event_name
    description:  Event name, e.g. "app_launch", "session_start", "login", "logout" etc.

  - name: event_timestamp
    description:  The time (in microseconds, UTC) at which the event was logged on the client,
                  e.g. "1666529002225262".

  - name: event_params
    description:  A repeated record (ARRAY) of the parameters associated with this event.

Upvotes: 1

Related Questions