Reputation: 489
I have this dataframe:
df <- data.frame(id_1=c('888046309', '888046309', '888046309', '888046309', '003046309', '465798132', '465798132', '465798132', '465798132', '465798132', '465798132', '465798132', '465798132'),
id_2=c('0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309', '0003046309'))
and I would like to create a column that would indicate where a part of id_2
is present in id_1
.
I tried this from this suggestion
i <- sapply(seq_along(df$id_2), function(i) grepl(df$id_2[i], df$id_1[i]))
df$flag <- c("No", "Yes")[i + 1L]
and flag had all NO values even though you can see that 46309 is present in both id_1
& id_2
.
I then tried this from this suggestion
df$flag_2 <- str_detect(df$id_1, df$id_2)
And I got FALSEs for flag_2
Finally I tried this, and like flag_2
, I got all FALSEs for flag_3
.
df <- df %>%
mutate(flag_3 = c('No', 'Yes')[1+str_detect(id_1, as.character(id_2))])
Can these suggestions be edited so that it can indicate if some part of id_2
is present in id_1
?
Upvotes: 1
Views: 75
Reputation: 72803
You could use adist
which basically calculates the cost to convert one string into another. Define a th
reshold how much replacement you want to allow and set a respective flag.
df$d <- t(apply(df[c('id_1', 'id_2')], 1, adist))[,2]
df
# id_1 id_2 d
# 1 888046309 0003046309 4
# 2 888046309 0003046309 4
# 3 888046309 0003046309 4
# 4 888046309 0003046309 4
# 5 003046309 0003046309 1
# 6 465798132 0003046309 9
# 7 465798132 0003046309 9
# 8 465798132 0003046309 9
# 9 465798132 0003046309 9
# 10 465798132 0003046309 9
# 11 465798132 0003046309 9
# 12 465798132 0003046309 9
# 13 465798132 0003046309 9
th <- nchar(df$id_2)[1] - 2
df$flag <- with(df, ifelse(d > th, 1, 0))
df
# id_1 id_2 d flag
# 1 888046309 0003046309 4 0
# 2 888046309 0003046309 4 0
# 3 888046309 0003046309 4 0
# 4 888046309 0003046309 4 0
# 5 003046309 0003046309 1 0
# 6 465798132 0003046309 9 1
# 7 465798132 0003046309 9 1
# 8 465798132 0003046309 9 1
# 9 465798132 0003046309 9 1
# 10 465798132 0003046309 9 1
# 11 465798132 0003046309 9 1
# 12 465798132 0003046309 9 1
# 13 465798132 0003046309 9 1
We can of course combine the two steps then it looks like this:
th <- 8 ## arbitrary
df <- within(df, flag <- t(apply(df[c('id_1', 'id_2')], 1, adist))[,2] < th)
df
# id_1 id_2 flag
# 1 888046309 0003046309 TRUE
# 2 888046309 0003046309 TRUE
# 3 888046309 0003046309 TRUE
# 4 888046309 0003046309 TRUE
# 5 003046309 0003046309 TRUE
# 6 465798132 0003046309 FALSE
# 7 465798132 0003046309 FALSE
# 8 465798132 0003046309 FALSE
# 9 465798132 0003046309 FALSE
# 10 465798132 0003046309 FALSE
# 11 465798132 0003046309 FALSE
# 12 465798132 0003046309 FALSE
# 13 465798132 0003046309 FALSE
Upvotes: 1
Reputation: 21908
I am not sure if this is what you have in mind, I tried it and I got some encouraging results, but there is still room for improvement. For this I erased 2 0
from the id_2
maybe it wasn't a good call but it facilitate fuzzy matching here.
df$flag <- sapply(1:nrow(df), function(x) agrepl(substr(df$id_2[x], 2, nchar(df$id_2[x])),
df$id_1[x], max.distance = 4))
df
id_1 id_2 flag
1 888046309 0003046309 TRUE
2 888046309 0003046309 TRUE
3 888046309 0003046309 TRUE
4 888046309 0003046309 TRUE
5 003046309 0003046309 TRUE
6 465798132 0003046309 FALSE
7 465798132 0003046309 FALSE
8 465798132 0003046309 FALSE
9 465798132 0003046309 FALSE
10 465798132 0003046309 FALSE
11 465798132 0003046309 FALSE
12 465798132 0003046309 FALSE
13 465798132 0003046309 FALSE
Upvotes: 1