Ajay
Ajay

Reputation: 1

SAS Fuzzy Lookup

I'm trying to do a fuzzy lookup on two datasets in SAS. I have searched over google and found the below link which explains the process of doing the fuzzy lookup in SAS. Link: http://blogs.sas.com/content/sgf/2015/01/27/how-to-perform-a-fuzzy-match-using-sas-functions/

To explain in detail the problem, the two datasets contains information of Hospital names and other additional information. I have to match both the data sets based on Hospital names. But the main challenge is in some cases I have the hospital name as follows:

Dataset1(hospital Name): St.Hospital Dataset2(hospital Name): Saint.Hospital

Like wise INC and Incorporated.

I would like to know is there any best way to do the fuzzy lookup in SAS.

Thanks, VJ

Upvotes: 0

Views: 309

Answers (1)

Chris Long
Chris Long

Reputation: 1319

There can't be any single best way to do a fuzzy lookup, as the article you linked to explains. You have to decide on the best approach for your particular problem domain and your particular tolerances for false positives and false negatives, etc.

For your data, I would probably just define a set of 'best guess' transformations on the hospital name in both input data sets, and then do a standard merge on the transformed names. The transformations would be something like:

  • Convert to uppercase
  • Convert 'ST.' or 'ST ' to 'SAINT' (or should that be 'STREET'??)
  • Convert 'INC' or 'INC.' to 'INCORPORATED'
  • Convert any other known common strings as above
  • Remove any remaining punctuation
  • Use COMPBL to reduce multiple spaces to a single space
  • Do the merge

You will then have to examine the result and decide if it's good enough for your purposes. There is no general way for a computer to match up two strings that might be arbitrarily badly-spelled, particularly if there are multiple possible 'correct' matches - this is the same problem that spell-checkers have been trying to solve for decades - there's no way of knowing (in isolation) whether a misspelled word like 'falt' was meant to be 'fault', 'fall', 'fast', 'fat' etc.

If your results have to be perfect, you will need a human to review anything that isn't an exact match, and even then some of the exact matches might be misspellings that happen to match another hospital's name (eg, 'Saint Mary's Hospital' vs 'Saint May's Hospital'). That's why the preferred approach would usually be to identify the hospital by an ID number and the name, rather than just the name.

Upvotes: 0

Related Questions