NightDog
NightDog

Reputation: 91

If one column contains words of another column match

Let's say A and B are columns in the dataset, I want to develop a fuzzy match logic - if there is at least one word in column A that matches one word in column B except for the word "bank" and "of", we assign 1 in the new column, if there's 0 match, we assign 0. I want to do this in R.

A                          B
BANK OF AMERICA         CHASE BANK
BANK OF AMERICA         BANK OF AMERICA, N.A.
BANK OF HOPE            HOPE BANK
T.D BANK                CHASE BANK

expected output

A                         B                        C
BANK OF AMERICA         CHASE BANK                 0
BANK OF AMERICA         BANK OF AMERICA, N.A       1
BANK OF HOPE            HOPE BANK                  1
T.D. BANK               CHASE BANK                 0

Upvotes: 2

Views: 247

Answers (3)

ThomasIsCoding
ThomasIsCoding

Reputation: 101247

Maybe the following base R option could help

df$C <- +do.call(
  function(...) mapply(function(...) any(!intersect(...) %in% c("BANK","OF")),...),
  Map(function(x) strsplit(x,"[[:punct:][:blank:]]",perl = TRUE), df, USE.NAMES = FALSE)
)

which gives

> df
                A                     B C
1 BANK OF AMERICA            CHASE BANK 0
2 BANK OF AMERICA BANK OF AMERICA, N.A. 1
3    BANK OF HOPE             HOPE BANK 1
4        T.D BANK            CHASE BANK 0

Upvotes: 1

tamtam
tamtam

Reputation: 3671

Here is another options - with dplyrand stringr.

df <- data.frame(A = c(rep("BANK OF AMERICA", 2), "BANK OF HOPE", "T.D BANK"),
                 B = c("CHASE BANK", "BANK OF AMERICA, N.A.", "HOPE BANK", "CHASE BANK"),
                 stringsAsFactors = FALSE) 


df <- df %>% 
  mutate(C = str_remove_all(B, c("BANK|OF|,")), #remove stopwords
         C = str_trim(C), #remove whitespace from start/end
         C = str_replace_all(C, "  ", ""), #remove double whitespaces
         C = str_replace_all(C, " ", "|")) %>% #replace whitespace with |
  mutate(D = as.numeric(str_detect(A, C))) %>% 
  select(A, B, D)

                A                     B D
1 BANK OF AMERICA            CHASE BANK 0
2 BANK OF AMERICA BANK OF AMERICA, N.A. 1
3    BANK OF HOPE             HOPE BANK 1
4        T.D BANK            CHASE BANK 0

Upvotes: 2

jamoreiras
jamoreiras

Reputation: 333

I believe a combination of regex and apply works well here.

> df <- data.frame(A = c('BANK OF AMERICA', 'BANK OF AMERICA', 'BANK OF HOPE', 'T.D BANK'),
                 B = c('CHASE BANK', 'BANK OF AMERICA, N.A.', 'HOPE BANK', 'CHASE BANK'),
                 stringsAsFactors = FALSE)

> f <- function(x) {
  left <- strsplit(x[1], "(BANK OF\\s|\\s|,|\\sBANK)")[[1]]
  right <- strsplit(x[2],  "(BANK OF\\s|\\s|,|\\sBANK)")[[1]]
  ans <- left %in% right
  as.integer(all(ans[!(left %in% "")]))
}

> df$C <- apply(df, 1, f)
> df
                A                     B C
1 BANK OF AMERICA            CHASE BANK 0
2 BANK OF AMERICA BANK OF AMERICA, N.A. 1
3    BANK OF HOPE             HOPE BANK 1
4        T.D BANK            CHASE BANK 0

Upvotes: 0

Related Questions