Reputation: 424
I work with anonymous data, where the destination may have been spelt incorrectly (I only observe an anonymized key for both the destination and origin, but I know the origin is correct).
origin<-c("Norway","Norway","Sweden","Sweden")
destination_typed<-c("Germany","Gerrmany","Spain","Spaiin")
df<-data.frame(origin=origin,destination=destination_typed)
df
I also have data on string similarity of the destinations. Again, I observe only anonymous keys for countries, and the score (how similar they are). So I do not know what is the correct spelling, i.e. I am just as happy with Spaiin as Spain, as long as they are grouped (dest_key_for_spain).
library(dplyr)
df_names<-expand.grid(destination_typed=destination_typed,
destination_alternatives=destination_typed,
stringsAsFactors = F) %>%
arrange(destination_typed) %>%
mutate(similarity_score=stringdist::stringsim(destination_typed,
destination_alternatives))
df_names
What I want is the anonymized destinations to be grouped together (e.g. if the similarity score is >0.5), i.e.:
df_wanted<-data.frame(origin=c("Norway","Sweden"), destination=c("dest_key_for_germany","dest_key_for_spain"))
df_wanted
Update: Since I actually have anonymous data, the data actually looks like this:
# using anonymized data:
df$destination[df$destination=="Germany"]<-"###123A"
df$destination[df$destination=="Gerrmany"]<-"#KL237#"
df_names$destination_typed[df_names$destination_typed=="Germany"]<-"###123A"
df_names$destination_typed[df_names$destination_typed=="Gerrmany"]<-"#KL237#"
df_names$destination_alternatives[df_names$destination_alternatives=="Germany"]<-"###123A"
df_names$destination_alternatives[df_names$destination_alternatives=="Gerrmany"]<-"#KL237#"
df$destination[df$destination=="Spain"]<-"##957KA"
df$destination[df$destination=="Spaiin"]<-"KLU##ab"
df_names$destination_typed[df_names$destination_typed=="Spain"]<-"##957KA"
df_names$destination_typed[df_names$destination_typed=="Spaiin"]<-"KLU##ab"
df_names$destination_alternatives[df_names$destination_alternatives=="Spain"]<-"##957KA"
df_names$destination_alternatives[df_names$destination_alternatives=="Spaiin"]<-"KLU##ab"
df
df_names
Upvotes: 0
Views: 62
Reputation: 424
The solution is just too simple. Actually, it was likely so simple and so badly explained by me (my first post at SO) that it was hard to understand what I wanted, so sorry for that! I only sorted the data first and then took the first similar (anonymized) country name that had a similarity score above a certain level. Using the anonymized data above:
df_names<-df_names %>%
arrange(destination_typed,destination_alternatives) %>%
filter(similarity_score>0.5) %>%
filter(!duplicated(destination_typed))
df %>%
left_join(df_names,by=c("destination"="destination_typed")) %>%
mutate(destination=destination_alternatives) %>%
select(-destination_alternatives,-similarity_score) %>%
distinct()
# which is essentially the same as I wanted:
df_wanted
Upvotes: 1
Reputation: 851
Here's an approach using Levenshtein distance for two strings. Everything would be much easier if you have a vector of possible destinations to check against. Otherwise you'll need to compute pairwise similarity between all typed destinations, and this will get hairy.
library(RecordLinkage)
library(tidyverse)
destination_typed<-c("Germany","Gerrmany","Spain","Spaiin")
destination_groups<-c("Germany","Spain")
tibble(destination_typed) %>%
mutate(group = map_chr(destination_typed,
~ Vectorize(RecordLinkage::levenshteinDist, vectorize.args = "str2")(.x, destination_groups) %>%
(function(X)destination_groups[which(X == min(X))]))
)
Upvotes: 0