how can I group based on similarity in strings

Question

I have a data like this

df <-structure(list(label = structure(c(5L, 6L, 7L, 8L, 3L, 1L, 2L, 
    9L, 10L, 4L), .Label = c(" holand", " holandindia", " Holandnorway", 
    " USAargentinabrazil", "Afghanestan ", "Afghanestankabol", "Afghanestankabolindia", 
    "indiaAfghanestan ", "USA", "USAargentina "), class = "factor"), 
        value = structure(c(5L, 4L, 1L, 9L, 7L, 10L, 6L, 3L, 2L, 
        8L), .Label = c("1941029507", "2367321518", "2849255881", 
        "2913128511", "2927576083", "4550996370", "457707181.9", 
        "637943892.6", "796495286.2", "89291651.19"), class = "factor")), .Names = c("label", 
    "value"), class = "data.frame", row.names = c(NA, -10L))

I want to get the largest name (in letter) and then see how many smaller and similar names are and assign them to a group

then go for another next large name and assign them to another group

until no group left

at first I calculate the length of each so I will have the length of them

library(dplyr)
dft <- data.frame(names=df$label,chr=apply(df,2,nchar)[,1])
colnames(dft)[1] <- "label"
df2 <- inner_join(df, dft)

Now I can simply find which string is the longest

df2[which.max(df2$chr),]

Now I should see which other strings have the letters similar to this long string . we have these possibilities

Afghanestankabolindia

it can be

A
Af
Afg
Afgh
Afgha
Afghan
Afghane
.
.
.

all possible combinations but the order of letter should be the same (from left to right) for example it should be Afghand cannot be fAhg

so we have only two other strings that are similar to this one

Afghanestan
Afghanestankabol

it is because they should be exactly similar and not even a letter different (more than the largest string) to be assigned to the same group

The desire output for this is as follows:

label                     value     group
Afghanestan              2927576083     1
Afghanestankabol         2913128511     1
Afghanestankabolindia    1941029507     1
indiaAfghanestan        796495286.2     2
 Holandnorway           457707181.9     3
 holand                 89291651.19     3
 holandindia            4550996370      3
USA                     2849255881      4
USAargentina            2367321518      4
USAargentinabrazil      637943892.6     4

why indiaAfghanestan is a seperate group? because it does not completely belong to another name (it has partially name from one or another). it should be part of a bigger name

I tried to use this one Find similar strings and reconcile them within one dataframe which did not help me at all

I found something else which maybe helps

require("Biostrings")
pairwiseAlignment(df2$label[3], df2$label[1], gapOpening=0, gapExtension=4,type="overlap")

but still I don't know how to assign them into one group

lukeA · Accepted Answer

You could try

library(magrittr)
df$label %>% 
  tolower %>% 
  trimws %>% 
  stringdist::stringdistmatrix(method = "jw", p = 0.1) %>% 
  as.dist %>% 
  `attr<-`("Labels", df$label) %>% 
  hclust %T>% 
  plot %T>% 
  rect.hclust(h = 0.3) %>% 
  cutree(h = 0.3) %>% 
  print -> df$group

df
# label       value group
# 1           Afghanestan   2927576083     1
# 2       Afghanestankabol  2913128511     1
# 3  Afghanestankabolindia  1941029507     1
# 4      indiaAfghanestan  796495286.2     2
# 5           Holandnorway 457707181.9     3
# 6                 holand 89291651.19     3
# 7            holandindia  4550996370     3
# 8                    USA  2849255881     4
# 9          USAargentina   2367321518     4
# 10    USAargentinabrazil 637943892.6     4

See ?stringdist::'stringdist-metrics' for an overview of the string dissimilarity measures offered by stringdist.

how can I group based on similarity in strings

Answers (1)

Related Questions