Reputation: 91
Let's say A
and B
are columns in the dataset, I want to develop a fuzzy match logic - if there is at least one word in column A that matches one word in column B except for the word "bank" and "of", we assign 1
in the new column, if there's 0 match, we assign 0
. I want to do this in R.
A B
BANK OF AMERICA CHASE BANK
BANK OF AMERICA BANK OF AMERICA, N.A.
BANK OF HOPE HOPE BANK
T.D BANK CHASE BANK
expected output
A B C
BANK OF AMERICA CHASE BANK 0
BANK OF AMERICA BANK OF AMERICA, N.A 1
BANK OF HOPE HOPE BANK 1
T.D. BANK CHASE BANK 0
Upvotes: 2
Views: 247
Reputation: 101247
Maybe the following base R option could help
df$C <- +do.call(
function(...) mapply(function(...) any(!intersect(...) %in% c("BANK","OF")),...),
Map(function(x) strsplit(x,"[[:punct:][:blank:]]",perl = TRUE), df, USE.NAMES = FALSE)
)
which gives
> df
A B C
1 BANK OF AMERICA CHASE BANK 0
2 BANK OF AMERICA BANK OF AMERICA, N.A. 1
3 BANK OF HOPE HOPE BANK 1
4 T.D BANK CHASE BANK 0
Upvotes: 1
Reputation: 3671
Here is another options - with dplyr
and stringr
.
df <- data.frame(A = c(rep("BANK OF AMERICA", 2), "BANK OF HOPE", "T.D BANK"),
B = c("CHASE BANK", "BANK OF AMERICA, N.A.", "HOPE BANK", "CHASE BANK"),
stringsAsFactors = FALSE)
df <- df %>%
mutate(C = str_remove_all(B, c("BANK|OF|,")), #remove stopwords
C = str_trim(C), #remove whitespace from start/end
C = str_replace_all(C, " ", ""), #remove double whitespaces
C = str_replace_all(C, " ", "|")) %>% #replace whitespace with |
mutate(D = as.numeric(str_detect(A, C))) %>%
select(A, B, D)
A B D
1 BANK OF AMERICA CHASE BANK 0
2 BANK OF AMERICA BANK OF AMERICA, N.A. 1
3 BANK OF HOPE HOPE BANK 1
4 T.D BANK CHASE BANK 0
Upvotes: 2
Reputation: 333
I believe a combination of regex and apply works well here.
> df <- data.frame(A = c('BANK OF AMERICA', 'BANK OF AMERICA', 'BANK OF HOPE', 'T.D BANK'),
B = c('CHASE BANK', 'BANK OF AMERICA, N.A.', 'HOPE BANK', 'CHASE BANK'),
stringsAsFactors = FALSE)
> f <- function(x) {
left <- strsplit(x[1], "(BANK OF\\s|\\s|,|\\sBANK)")[[1]]
right <- strsplit(x[2], "(BANK OF\\s|\\s|,|\\sBANK)")[[1]]
ans <- left %in% right
as.integer(all(ans[!(left %in% "")]))
}
> df$C <- apply(df, 1, f)
> df
A B C
1 BANK OF AMERICA CHASE BANK 0
2 BANK OF AMERICA BANK OF AMERICA, N.A. 1
3 BANK OF HOPE HOPE BANK 1
4 T.D BANK CHASE BANK 0
Upvotes: 0