Roman
Roman

Reputation: 9441

BigQuery join on similar keys

I have keys that may change slightly in both tables, because they were both independently entered via free text. Any human eye would spot that they are the same, but the tables are too big to created regular expression rules for each type of mismatch.

For example, these should match:
This Is Key One and This is Key one
Subject:the Description and Subject: the Description
I am key two! and I am key two

How do I create such a query? (if it's even possible in BQ)

Upvotes: 1

Views: 328

Answers (2)

Zig Mandel
Zig Mandel

Reputation: 19835

first you need to identify every single possible way that your keys can be similar. for example one is ignoring case, another is removing special characters like "!". you can make a user-defined function that handles all such cases to "clean" or "standarize" the keys.

create a new clean table using the custom function and query on the clean table (and always clean the keys before appending more rows from then on)

https://cloud.google.com/bigquery/user-defined-functions

Upvotes: 1

Dan Delorey
Dan Delorey

Reputation: 161

An alternative to using a user-defined function as Zig suggests would be to use the various BigQuery string functions to clean your data.

Whatever method you use, though, you'll need to do it either in a subquery prior to the join or you'll have to do it as a preliminary step before your query. BigQuery at this time only support join predicates which are conjunctions of equality comparisons (i.e., a.key1 = b.key1 AND a.key2 = b.key2 etc.).

Upvotes: 3

Related Questions