Cooper
Cooper

Reputation: 45

Fuzzy matching for slight difference

I have a column of company names and I would like to count how many different companies in that column. In this column, some identical companies have slight difference in their names, for example, these companies should be counted only once.

ASAHI INTECC CO., LTD.
Asahi Intecc USA Inc
ASAHI INTECC USA, INC

I want the codes that could work in general, which could precisely count the numbers of companies without counting the duplicates with slight difference. For example, this reproducible data should return a value of 6

company <- read.table(text = "
          CompanyName
          'MERCK SHARP & DOHME CORPORATION'
          'GILEAD SCIENCES INC'
          'BOEHRINGER INGELHEIM PHARMACEUTICALS, INC.'
          'ABBVIE, INC.'
          'JANSSEN SCIENTIFIC AFFAIRS, LLC'
          'BOEHRINGER INGELHEIM PHARMA GMBH & CO.KG'
          'ASAHI INTECC CO., LTD.'
          'Asahi Intecc USA Inc'
", header = TRUE, stringsAsFactors = FALSE)

I looked at How can I match fuzzy match strings from two datasets? But I still do not have an idea how to construct the codes. Hope for any advice

Upvotes: 1

Views: 538

Answers (1)

Yifu Yan
Yifu Yan

Reputation: 6106

To compare similarity between string, first step is usually cleaning the data with best knowledge you have:

Since many methods for calculating string distance will treat upper-case and lower-case letters as different letters, so first you should convert all characters to the same case. And you could do any other cleaning to help improve the accuracy.

library(dplyr)
companyName <- company$CompanyName %>%
    toupper() %>% # convert to upper case
    stringr::str_replace_all("\\s+"," ") %>% # convert any consecutive whitespaces to single space
    stringr::str_remove_all("\\.|,") # remove all comma or dot
> companyName
[1] "MERCK SHARP & DOHME CORPORATION"          "GILEAD SCIENCES INC"                      "BOEHRINGER INGELHEIM PHARMACEUTICALS INC"
[4] "ABBVIE INC"                               "JANSSEN SCIENTIFIC AFFAIRS LLC"           "BOEHRINGER INGELHEIM PHARMA GMBH & COKG" 
[7] "ASAHI INTECC CO LTD"                      "ASAHI INTECC USA INC"    

Calculate string distance:

distanceMatrix <- stringdist::stringdistmatrix(
    a = companyName,
    b = companyName,
    # You can pick the method that works best for your data. Also, manual inspection is needed. See ?stringdist 
    # I'm picking soundex for this example
    method = "soundex"
)

By using soundex method, if a cell is 0, it means the corresponding row and column are very close

> distanceMatrix
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,]    0    1    1    1    1    1    1    1
[2,]    1    0    1    1    1    1    1    1
[3,]    1    1    0    1    1    0    1    1
[4,]    1    1    1    0    1    1    1    1
[5,]    1    1    1    1    0    1    1    1
[6,]    1    1    0    1    1    0    1    1
[7,]    1    1    1    1    1    1    0    0
[8,]    1    1    1    1    1    1    0    0

This means, in the companyName vector, item 3 is close to item 6, and item 7 is close to item 8.

result <- which(distanceMatrix==0,arr.ind = TRUE) %>%
    as.data.frame() %>%
    dplyr::filter(col > row)
> result
  row col
1   3   6
2   7   8

> result %>% mutate_all(~companyName[.x])
                                       row                                     col
1 BOEHRINGER INGELHEIM PHARMACEUTICALS INC BOEHRINGER INGELHEIM PHARMA GMBH & COKG
2                      ASAHI INTECC CO LTD                    ASAHI INTECC USA INC

Please note that you can increase accuracy by cleaning the string or choosing different methods, parameters or threshold when calculating string distance. But it can never grantee 100% accuracy.

Finally, to count unique companies, we could do:

> length(companyName) - length(unique(result$row))
[1] 6

Upvotes: 4

Related Questions