Getting approximately unique values from character vector

Question

Identifying unique values is straight forward when the data is well behaved. Here I am looking for an approach to get a list of approximately unique values from a character vector.

Let x be a vector with slightly different names for an entity, e.g. Kentucky loader may appear as Kentucky load or Kentucky loader (additional info) or somewhat similar.

x <- c("Kentucky load" ,                                                                                                            
       "Kentucky loader (additional info)",                                                                                     
       "CarPark Gifhorn (EAP)",
       "Car Park  Gifhorn (EAP) new 1.5.2012",
       "Center Kassel (neu 01.01.2014)",
       "HLLS Bremen (EAP)",
       "HLLS Bremen (EAP) new 06.2013",
       "Hamburg total sum (abc + TBL)",
       "Hamburg total (abc + TBL) new 2012")

What I what to get out is something like:

c("Kentucky loader" ,                                                                                                            
  "Car Park Gifhorn (EAP)",
  "Center Kassel (neu 01.01.2014)",
  "HLLS Bremen (EAP)",
  "Hamburg total (abc + TBL)")

Idea

Calculate some similarity measure between all strings (e.g. Levenshtein distance)
Use longest common subset method
Somehow :( decide which strings belong together based on this information.

But I guess this will be a standard task (for those R users working with "dirty" data regularly), so I assume there will be a set of standard approaches to it.

Does someone have a hint or is there a package that does this?

jeremycg · Accepted Answer

As @Jaap said, try playing with OpenRefine. The data carpentry course is pretty good.

If you do want to stay in R, here's a solution for your example, using agrepl:

z <- sapply(x, function(z) agrepl(z, x, max.distance = 0.2))
apply(z, 1, function(myz) x[myz][which.min(nchar(x[myz]))])

Which gives the smallest match in chars found for each member of x:

[1] "Kentucky load"                  "Kentucky load"                  "CarPark Gifhorn (EAP)"         
[4] "CarPark Gifhorn (EAP)"          "Center Kassel (neu 01.01.2014)" "HLLS Bremen (EAP)"             
[7] "HLLS Bremen (EAP)"              "Hamburg total sum (abc + TBL)"  "Hamburg total sum (abc + TBL)"

This is good if you want to keep order of your vector to match others (or use on a column of a dataframe).

You can call unique on this output to get your desired output.

Getting approximately unique values from character vector

Answers (1)

Related Questions