Kjetil Haukås
Kjetil Haukås

Reputation: 424

String similarity grouping with anonymous data

I work with anonymous data, where the destination may have been spelt incorrectly (I only observe an anonymized key for both the destination and origin, but I know the origin is correct).

origin<-c("Norway","Norway","Sweden","Sweden")
destination_typed<-c("Germany","Gerrmany","Spain","Spaiin")
df<-data.frame(origin=origin,destination=destination_typed)
df

I also have data on string similarity of the destinations. Again, I observe only anonymous keys for countries, and the score (how similar they are). So I do not know what is the correct spelling, i.e. I am just as happy with Spaiin as Spain, as long as they are grouped (dest_key_for_spain).

library(dplyr)
df_names<-expand.grid(destination_typed=destination_typed,
            destination_alternatives=destination_typed,
            stringsAsFactors = F) %>% 
  arrange(destination_typed) %>% 
  mutate(similarity_score=stringdist::stringsim(destination_typed,
                                                destination_alternatives))
df_names

What I want is the anonymized destinations to be grouped together (e.g. if the similarity score is >0.5), i.e.:

df_wanted<-data.frame(origin=c("Norway","Sweden"),                 destination=c("dest_key_for_germany","dest_key_for_spain"))
df_wanted

Update: Since I actually have anonymous data, the data actually looks like this:

# using anonymized data:
df$destination[df$destination=="Germany"]<-"###123A"
df$destination[df$destination=="Gerrmany"]<-"#KL237#"
df_names$destination_typed[df_names$destination_typed=="Germany"]<-"###123A"
df_names$destination_typed[df_names$destination_typed=="Gerrmany"]<-"#KL237#"
df_names$destination_alternatives[df_names$destination_alternatives=="Germany"]<-"###123A"
df_names$destination_alternatives[df_names$destination_alternatives=="Gerrmany"]<-"#KL237#"
df$destination[df$destination=="Spain"]<-"##957KA"
df$destination[df$destination=="Spaiin"]<-"KLU##ab"
df_names$destination_typed[df_names$destination_typed=="Spain"]<-"##957KA"
df_names$destination_typed[df_names$destination_typed=="Spaiin"]<-"KLU##ab"
df_names$destination_alternatives[df_names$destination_alternatives=="Spain"]<-"##957KA"
df_names$destination_alternatives[df_names$destination_alternatives=="Spaiin"]<-"KLU##ab"

df
df_names

Upvotes: 0

Views: 62

Answers (2)

Kjetil Hauk&#229;s
Kjetil Hauk&#229;s

Reputation: 424

The solution is just too simple. Actually, it was likely so simple and so badly explained by me (my first post at SO) that it was hard to understand what I wanted, so sorry for that! I only sorted the data first and then took the first similar (anonymized) country name that had a similarity score above a certain level. Using the anonymized data above:

df_names<-df_names %>% 
  arrange(destination_typed,destination_alternatives) %>% 
  filter(similarity_score>0.5) %>% 
  filter(!duplicated(destination_typed))

df %>% 
  left_join(df_names,by=c("destination"="destination_typed")) %>% 
  mutate(destination=destination_alternatives) %>% 
  select(-destination_alternatives,-similarity_score) %>% 
  distinct()

# which is essentially the same as I wanted:

df_wanted

Upvotes: 1

BrianLang
BrianLang

Reputation: 851

Here's an approach using Levenshtein distance for two strings. Everything would be much easier if you have a vector of possible destinations to check against. Otherwise you'll need to compute pairwise similarity between all typed destinations, and this will get hairy.

library(RecordLinkage)
library(tidyverse)

destination_typed<-c("Germany","Gerrmany","Spain","Spaiin")
destination_groups<-c("Germany","Spain")

tibble(destination_typed) %>%
  mutate(group = map_chr(destination_typed,
                     ~ Vectorize(RecordLinkage::levenshteinDist, vectorize.args = "str2")(.x, destination_groups) %>%
                       (function(X)destination_groups[which(X == min(X))]))
)

Upvotes: 0

Related Questions