Noob_Strider
Noob_Strider

Reputation: 183

Removing "almost duplicates" using SAS or Excel

I am working in SAS and I have a data-set with 2 columns and I want not only to remove the duplicates, but also the "almost" duplicates. The data looks like this:

**Brand        Product**
Coca Cola    Coca Cola Light
Coca Cola    Coca Cola Lgt
Coca Cola    Cocacolalight
Coca Cola    Coca Cola Vanila
  Pepsi       Pepsi Zero
  Pepsi       Pepsi Zro

i do not know if it is actually possible, but what I would like the file to look like after removing the "duplicates", is like that:

    **Brand        Product**
    Coca Cola    Coca Cola Light
    Coca Cola    Coca Cola Vanila
      Pepsi       Pepsi Zero

I don't have a preference if the final table will have e.g. "Pepsi Zero" or "Pepsi Zro" as long as there are no "duplicate" values.

I was thinking if there was a way to compare the e.g. first 4-5 letters and if they are the same then to consider them as duplicates. But of course I am open to suggestions. If there is a way to be done even in excel I would be interested to hear it.

Upvotes: 4

Views: 683

Answers (1)

Robert Penridge
Robert Penridge

Reputation: 8523

I'm going to start by straight up quoting Jeff's answer :

SAS has at least a couple functions for calculating edit distance between two strings:

Compged, for general edit distance: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206133.htm

Complev, for Levenshtein distance: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206137.htm

There's also the spedis() function for comparing edit distances.

Now those are all great, but my personal favorite is the soundex() function which will allow you to test if two words 'sound' the same. It's not going to be 100% correct but in this case the results work alright.

First some data:

Data HAVE;
  attrib name length=$20 alt_name length=$20;
  infile datalines dsd dlm=',' truncover;
  input name $ alt_name $;
  datalines;
Coca Cola    ,Coca Cola Light
Coca Cola    ,Coca Cola Lgt
Coca Cola    ,Cocacolalight
Coca Cola    ,Coca Cola Vanila
Pepsi        ,Pepsi Zero
Pepsi        ,Pepsi Zro
;
Run;

Get every combination of words that we want to compare, and calculate the soundex()s for eyeballing:

proc sql noprint;
  create table cartesian as
  select a.name,
         a.alt_name as alt_name1,
         b.alt_name as alt_name2,
         soundex(a.alt_name) as soundex_a,
         soundex(b.alt_name) as soundex_b
  from have a, have b
  where a.name = b.name
    and soundex(a.alt_name) eq soundex(b.alt_name)
  ;
quit;

Now I'll leave it up to use as an exercise to dedupe the resulting list. But basically this will tell you which words match up. If you get false-positives for the matches, just add them to an exception list to manually transform those particular values.

Upvotes: 4

Related Questions