Riccardo Lamera
Riccardo Lamera

Reputation: 105

Databricks sql, how to compare string escaping characters

I'm analysing two columns to filter in records that don't look alike:

select firstname, name 
from tableA
where lower(firstname) != lower(name)

I accept names that match even if few special characters don't, for instance:

enter image description here

These two names to Databricks are deemed to be different (rightly so) but I want to escape that '-' charachter. Is there a way to say 'look for different names without considering the '-' character'? Or maybe a way to get a index on how much the two strings differ.

Upvotes: 1

Views: 1032

Answers (1)

Alex Ott
Alex Ott

Reputation: 87249

There are several approaches for that:

  • Do normalization of strings before comparison - lower or upper case the strings, replace all characters like -, multiple spaces, etc. with single space - you can use regexp_replace function for that. Something like this:
select firstname, name 
from tableA
where 
  regex_replace(lower(firstname), '[- ]+', ' ') != 
    regex_replace(lower(name), '[- ]+', ' ')

It could be further simplified by creating a user-defined SQL function for that normalization.

select firstname, name 
from tableA
where levenstein(lower(firstname), lower(name)) != 0

Upvotes: 1

Related Questions