Reputation: 11431
Identifying unique
values is straight forward when the data is well behaved. Here I am looking for an approach to get a list of approximately unique values from a character vector.
Let x
be a vector with slightly different names for an entity, e.g. Kentucky loader
may appear as Kentucky load
or Kentucky loader (additional info)
or somewhat similar.
x <- c("Kentucky load" ,
"Kentucky loader (additional info)",
"CarPark Gifhorn (EAP)",
"Car Park Gifhorn (EAP) new 1.5.2012",
"Center Kassel (neu 01.01.2014)",
"HLLS Bremen (EAP)",
"HLLS Bremen (EAP) new 06.2013",
"Hamburg total sum (abc + TBL)",
"Hamburg total (abc + TBL) new 2012")
What I what to get out is something like:
c("Kentucky loader" ,
"Car Park Gifhorn (EAP)",
"Center Kassel (neu 01.01.2014)",
"HLLS Bremen (EAP)",
"Hamburg total (abc + TBL)")
Idea
But I guess this will be a standard task (for those R users working with "dirty" data regularly), so I assume there will be a set of standard approaches to it.
Does someone have a hint or is there a package that does this?
Upvotes: 3
Views: 159
Reputation: 24945
As @Jaap said, try playing with OpenRefine. The data carpentry course is pretty good.
If you do want to stay in R, here's a solution for your example, using agrepl
:
z <- sapply(x, function(z) agrepl(z, x, max.distance = 0.2))
apply(z, 1, function(myz) x[myz][which.min(nchar(x[myz]))])
Which gives the smallest match in chars found for each member of x:
[1] "Kentucky load" "Kentucky load" "CarPark Gifhorn (EAP)"
[4] "CarPark Gifhorn (EAP)" "Center Kassel (neu 01.01.2014)" "HLLS Bremen (EAP)"
[7] "HLLS Bremen (EAP)" "Hamburg total sum (abc + TBL)" "Hamburg total sum (abc + TBL)"
This is good if you want to keep order of your vector to match others (or use on a column of a dataframe).
You can call unique
on this output to get your desired output.
Upvotes: 2