tangerine7199
tangerine7199

Reputation: 489

Does part of one column exist in another? R

I have this dataframe:

df <- data.frame(id_1=c('888046309', '888046309', '888046309', '888046309', '003046309', '465798132', '465798132', '465798132', '465798132', '465798132', '465798132', '465798132', '465798132'), 
                 id_2=c('0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309'))

and I would like to create a column that would indicate where a part of id_2 is present in id_1.

I tried this from this suggestion

i <- sapply(seq_along(df$id_2), function(i) grepl(df$id_2[i], df$id_1[i]))

df$flag <- c("No", "Yes")[i + 1L]

and flag had all NO values even though you can see that 46309 is present in both id_1 & id_2.

I then tried this from this suggestion

df$flag_2 <- str_detect(df$id_1, df$id_2)

And I got FALSEs for flag_2

Finally I tried this, and like flag_2, I got all FALSEs for flag_3.

df <- df %>% 
    mutate(flag_3 = c('No', 'Yes')[1+str_detect(id_1, as.character(id_2))])

Can these suggestions be edited so that it can indicate if some part of id_2 is present in id_1?

Upvotes: 1

Views: 75

Answers (2)

jay.sf
jay.sf

Reputation: 72803

You could use adist which basically calculates the cost to convert one string into another. Define a threshold how much replacement you want to allow and set a respective flag.

df$d <- t(apply(df[c('id_1', 'id_2')], 1, adist))[,2]
df
#         id_1       id_2 d
# 1  888046309 0003046309 4
# 2  888046309 0003046309 4
# 3  888046309 0003046309 4
# 4  888046309 0003046309 4
# 5  003046309 0003046309 1
# 6  465798132 0003046309 9
# 7  465798132 0003046309 9
# 8  465798132 0003046309 9
# 9  465798132 0003046309 9
# 10 465798132 0003046309 9
# 11 465798132 0003046309 9
# 12 465798132 0003046309 9
# 13 465798132 0003046309 9

th <- nchar(df$id_2)[1] - 2
df$flag <- with(df, ifelse(d > th, 1, 0))
df
#         id_1       id_2 d flag
# 1  888046309 0003046309 4    0
# 2  888046309 0003046309 4    0
# 3  888046309 0003046309 4    0
# 4  888046309 0003046309 4    0
# 5  003046309 0003046309 1    0
# 6  465798132 0003046309 9    1
# 7  465798132 0003046309 9    1
# 8  465798132 0003046309 9    1
# 9  465798132 0003046309 9    1
# 10 465798132 0003046309 9    1
# 11 465798132 0003046309 9    1
# 12 465798132 0003046309 9    1
# 13 465798132 0003046309 9    1

Combined

We can of course combine the two steps then it looks like this:

th <- 8  ## arbitrary
df <- within(df, flag <- t(apply(df[c('id_1', 'id_2')], 1, adist))[,2] < th)
df
#         id_1       id_2  flag
# 1  888046309 0003046309  TRUE
# 2  888046309 0003046309  TRUE
# 3  888046309 0003046309  TRUE
# 4  888046309 0003046309  TRUE
# 5  003046309 0003046309  TRUE
# 6  465798132 0003046309 FALSE
# 7  465798132 0003046309 FALSE
# 8  465798132 0003046309 FALSE
# 9  465798132 0003046309 FALSE
# 10 465798132 0003046309 FALSE
# 11 465798132 0003046309 FALSE
# 12 465798132 0003046309 FALSE
# 13 465798132 0003046309 FALSE

Upvotes: 1

Anoushiravan R
Anoushiravan R

Reputation: 21908

I am not sure if this is what you have in mind, I tried it and I got some encouraging results, but there is still room for improvement. For this I erased 2 0 from the id_2 maybe it wasn't a good call but it facilitate fuzzy matching here.

df$flag <- sapply(1:nrow(df), function(x) agrepl(substr(df$id_2[x], 2, nchar(df$id_2[x])), 
                                                 df$id_1[x], max.distance = 4))


df

        id_1       id_2  flag
1  888046309 0003046309  TRUE
2  888046309 0003046309  TRUE
3  888046309 0003046309  TRUE
4  888046309 0003046309  TRUE
5  003046309 0003046309  TRUE
6  465798132 0003046309 FALSE
7  465798132 0003046309 FALSE
8  465798132 0003046309 FALSE
9  465798132 0003046309 FALSE
10 465798132 0003046309 FALSE
11 465798132 0003046309 FALSE
12 465798132 0003046309 FALSE
13 465798132 0003046309 FALSE

Upvotes: 1

Related Questions