Zolthar Aje
Zolthar Aje

Reputation: 33

How to find similar string value in Bigquery

I have two tables in my database, each table has column with names. How to compare these tables columns and how to find these names what has exact matches and these names, which are similar in table1 and table2?

For example :

Table 1

AHMAD JAN AKHUNDZADA SHUKOOR AKHUNDZADA alias AHMAD JAN AKHUNZADA alias AHMAD JAN AKHUND ZADA  
Khatiba Imam Al-Bukhari (KIB)
MOHAMMAD SADIQ alias AMIR MOHAMMAD
Fuad

Table 2 :

Fuad
Khatib Imam Al-Bukhari
Khabiba Imam Al - Bukhari
ahmad jan Akhunzada shukoor akhundzada
AHMAD JAN AKHUNZADA
AHMAD JAN AKHUND ZADA 
AMIR MOHAMMAD
MOHAMMAD SADIQ

Result should come out with this:

Table 1 --> Table 2

MOHAMMAD SADIQ alias AMIR MOHAMMAD --> AMIR MOHAMMAD
MOHAMMAD SADIQ alias AMIR MOHAMMAD --> MOHAMMAD SADIQ
AHMAD JAN AKHUNDZADA SHUKOOR AKHUNDZADA alias AHMAD JAN AKHUNZADA alias AHMAD JAN AKHUND ZADA  --> AHMAD JAN AKHUNZADA
AHMAD JAN AKHUNDZADA SHUKOOR AKHUNDZADA alias AHMAD JAN AKHUNZADA alias AHMAD JAN AKHUND ZADA --> ahmad jan Akhunzada shukoor akhundzada
AHMAD JAN AKHUNDZADA SHUKOOR AKHUNDZADA alias AHMAD JAN AKHUNZADA alias AHMAD JAN AKHUND ZADA --> AHMAD JAN AKHUND Z
Khatiba Imam Al-Bukhari (KIB) --> Khatib Imam Al-Bukhari

how to effectively find similar names?

Upvotes: 2

Views: 3919

Answers (3)

Kabilan Mohanraj
Kabilan Mohanraj

Reputation: 1906

As I understand, there are some names in both tables that are not exactly the same and vary by some number of characters. For instance, there is a character d missing in Akhun(d)zada but present in AKHUNDZADA.

To find similarity between 2 strings, the Jaccard distance or the Levenshtein distance UDFs can be used. The jaccard() and levenshtein() functions are built-in UDFs (community provided) in BigQuery. Other UDFs available in BigQuery can be found here.

Consider the below query for your use case. jaccard() UDF has been used in this query.

SELECT distinct table_1.Name, table_2.Name as Name_table_2
FROM (
    SELECT Name, trim(name_unnest) as name_trim
    FROM 
    `project-id.dataset-id.table-1`, unnest(split(Name, "alias")) as name_unnest 
) as table_1, `project-id.dataset-id.table-2` as table_2 
where bqutil.fn.jaccard(lower(table_1.name_trim), lower(table_2.Name))>=0.8
order by Name;

Output of the query:

enter image description here

Also, please note that for the given set of inputs, a threshold of 0.8 works well. The threshold might not hold good for other inputs and has to be adjusted accordingly. The lower the threshold, the more distinct the names will become and lesser is the similarity.

Upvotes: 3

Thorsten Kettner
Thorsten Kettner

Reputation: 94914

You are suffering from a bad data model. The string in table1 can hold multiple values, thus violating database normalization.

To go about this, you can split the string into its separate values. In your case, however, you need two steps for this:

  • Split the strings at the key word "alias"
  • For each word containing additional information in parentheses, add a string without that information. (E.g. for 'Khatiba Imam Al-Bukhari (KIB)' you want to add 'Khatiba Imam Al-Bukhari' to your list of search strings.)

Once you have done this, you want to compare regardless of upper/lower case (so that 'AHMAD JAN AKHUNZADA SHUKOOR AKHUNDZADA' matches 'ahmad jan Akhunzada shukoor akhundzada').

In Google BigQuery you get from a list in a string to a proper array with SPLIT. Then, you can use UNNEST to get from the array to single rows with one name each. For removing text in parantheses, I'd use REGEXP_REPLACE. And for a case insensitive comparison you can convert all characters to lower case.

with raw_names as
(
  select
    names as names_string,
    trim(raw_name) as name
  from table1
  cross join unnest(lower(names), 'alias') as raw_name
)
, names as
(
  select
    names_string, trim(raw_name) as name
  from raw_names
  union
  select
    names_string, trim(regexp_replace(raw_name, '\([^\)]*\)', '')) as name
  from raw_names
)
select names.names_string, table2.name
from names
join table2 on lower(table2.name) = names.name
order by names.names_string, table2.name;

This is untested. I have never worked with BigQuery, so I hope that I got everything right.

Upvotes: 0

Sathiya
Sathiya

Reputation: 239

Try the SQL joins to compare the two tables.

SELECT Table_1.Name FROM Table_1
LEFT JOIN Table_2 ON Table_1.Name = Table_2.Name;

Upvotes: -1

Related Questions