Reputation: 501
I have a table with people and their hometown names, but there are same cities with different written, see:
Name | Hometown |
---|---|
João | São Paulo |
Maria | Sao Paul |
Pedro | São Paulo. |
Maria | S. Paulo |
And I need to process this in order to formalize that data to be like this:
Name | Hometown |
---|---|
João | São Paulo |
Maria | São Paulo |
Pedro | São Paulo |
Maria | São Paulo |
I tried this stack and would it be exactly what I need but does not work with my entire dataset.
Upvotes: 2
Views: 5775
Reputation: 164679
First, the basics.
The first one is straight-forward, strip out everything which isn't a letter so São Paulo
and São Paulo.
are both SãoPaulo
.
Case folding is also straight-forward, change everything to lower or upper case. são paulo
and São Paulo
compare the same.
Finally, convert them to the normal ASCII equivalents. For example, são
becomes sao
.
With this normalization done, the issues of spaces, extra characters, accents, and cases are taken are of. I would recommend doing this outside of BigQuery and in a language like Python. Do a select distinct
and transform and compare each value using libraries such as unidecode.
You can then employ some heuristics to try and find "close enough" matches. One example is the Levenshtein distance which is the number of substitutions, insertions, and deletions one needs to do to turn one string into another. Python has a Levenshtein library.
For example, Sao Paul
and Sao Paulo
have a Levenshtein distance of one; add one letter. S Paulo
and Sao Paulo
have a Levenshtein distance of two, add two letters. Sao Paulo
and Saint Paul
have a Levenshtein distance of four; change o to i, add n and t, remove o.
Again, I'd recommend doing this with a regular programming language and then writing the normalized results back to BigQuery.
Upvotes: 2
Reputation: 172974
Consider below approach (considering you have lookup table with all proper cities names) for purpose of example - I have it as CTE with just few ones
with cities as (
select 'São Paulo' as city union all
select 'Los Angeles' union all
select 'Dnipro' union all
select 'Kyiv'
)
select Name, City as Hometown
from your_table
left join cities
on soundex(Hometown) = soundex(city)
if applied to sample data in your question - output is
Note: you obviously need to take care of potential duplication in case if some cities sounds similar, in this case adding country constraints might help ...
Upvotes: 2