Reputation: 669
I have two data frames, with 4 different columns in different orders. One data frame 1
has an ID
column with strings having the complete and correct name. Then I have a data frame 2
with a ID column which is not complete (ID_not_complete
). Data frame 1
has more rows and contains 100% all strings from data frame 2
. I would like to add the missing strings to the ID_not_complete
column in data frame 2
. As it is in my example, a string in data frame 2
can have multiple matches in data frame 1
, but only one matches the exact length of the string:
rs1725 --> AX-42144793569__rs1725
rs1725 --> AX-42179369__rs1725074
The first option should be the correct one.
Data frame 1
ID<-c("AX-35388475__rs16896864","AX-11425569__rs289621","AX-11102771__rs10261724","AX-42179369__rs1725074","AX-42144793569__rs1725","AX-42749369__rs264930","AX-32893019__rs6114382")
ID<-as.data.frame(ID)
Data frame 2
ID_not_complete<-c("rs16896864","rs289621","rs10261724","rs1725074","rs1725")
ID_not_complete <-as.data.frame(ID_not_complete)
The output data frame 2 should look like:
ID_complete<-c("AX-35388475__rs16896864","AX-11425569__rs289621","AX-11102771__rs10261724","AX-42179369__rs1725074","AX-42144793569__rs1725")
ID_complete <-as.data.frame(ID_complete)
I think I need to use grep
. But I really don't know how to to it for each value in a column.
Upvotes: 0
Views: 231
Reputation: 21400
EDIT:
Data:
ID <- c("AX-35388475__rs16896864","AX-11425569__rs289621","AX-11102771__rs10261724","AX-42179369__rs1725074","AX-42144793569__rs1725","AX-42749369__rs264930","AX-32893019__rs6114382")
df1 <- data.frame(
ID = ID
)
ID_not_complete<-c("rs16896864","rs289621","rs10261724","rs1725074","rs1725")
df2 <- data.frame(
ID_not_complete = ID_not_complete
)
Solution: First define patterns:
patt0 <- paste0(sub("([^_]+)__(.*)", "\\2", df1$ID), collapse = "|")
patt1 <- paste0(df2$ID_not_complete, collapse = "|")
Now transform df2$ID_not_complete
:
df2$ID_not_complete <- paste0(sub("([^_]+)__(.*)", "\\1__", df1$ID[grepl(patt1, df1$ID)]),
grep(patt0, df2$ID_not_complete, value = T))
Result:
df2
ID_not_complete
1 AX-35388475__rs16896864
2 AX-11425569__rs289621
3 AX-11102771__rs10261724
4 AX-42179369__rs1725074
5 AX-42144793569__rs1725
Upvotes: 1
Reputation: 160447
The fuzzyjoin
package allows for joining on regular expressions (patterns).
A first (flawed) approach is:
fuzzyjoin::regex_inner_join(ID, ID_not_complete, by = c(ID="ID_not_complete"))
# ID ID_not_complete
# 1 AX-35388475__rs16896864 rs16896864
# 2 AX-11425569__rs289621 rs289621
# 3 AX-11102771__rs10261724 rs10261724
# 4 AX-42179369__rs1725074 rs1725074
# 5 AX-42179369__rs1725074 rs1725
# 6 AX-42144793569__rs1725 rs1725
where rs1725
matches both rs1725
and rs1725074
(matching leading characters). I'll infer that you don't mean for this to happen, so a quick fix using some additional boundary-like patterns (also correcting for your data having spaces):
ID_not_complete$ptn <- paste0("(^|[\\s_])", ID_not_complete$ID_not_complete, "([\\s_]|$)")
fuzzyjoin::regex_inner_join(ID, ID_not_complete, by = c(ID="ptn"))
# ID ID_not_complete ptn
# 1 AX-35388475__rs16896864 rs16896864 (^|[\\s_])rs16896864([\\s_]|$)
# 2 AX-11425569__rs289621 rs289621 (^|[\\s_])rs289621([\\s_]|$)
# 3 AX-11102771__rs10261724 rs10261724 (^|[\\s_])rs10261724([\\s_]|$)
# 4 AX-42179369__rs1725074 rs1725074 (^|[\\s_])rs1725074([\\s_]|$)
# 5 AX-42144793569__rs1725 rs1725 (^|[\\s_])rs1725([\\s_]|$)
(Side note: I originally wanted and intended to use the regex word-boundary \b
in the pattern, but according to https://www.regular-expressions.info/shorthand.html, it considered [A-Za-z0-9_]
to be "word characters", so the _
preceding the real IDs was not a boundary. So if other users have a similar problem that does not use underscores, then the (^|[\\s_])
can be replaced completely with \\b
, similarly for the end-pattern.)
Edit
If all you need is to filter out from ID
those that are not found otherwise, then perhaps just
paste0("_(", paste(ID_not_complete$ID_not_complete, collapse = "|"), ") *$")
# [1] "_(rs16896864|rs289621|rs10261724|rs1725074|rs1725) *$"
grepl(paste0("_(", paste(ID_not_complete$ID_not_complete, collapse = "|"), ") *$"), ID$ID)
# [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE
ID[grepl(paste0("_(", paste(ID_not_complete$ID_not_complete, collapse = "|"), ") *$"), ID$ID),,drop=FALSE]
# ID
# 1 AX-35388475__rs16896864
# 2 AX-11425569__rs289621
# 3 AX-11102771__rs10261724
# 4 AX-42179369__rs1725074
# 5 AX-42144793569__rs1725
or
gsub("\\s", "", gsub(".*_", "", ID$ID))
# [1] "rs16896864" "rs289621" "rs10261724" "rs1725074" "rs1725" "rs264930" "rs6114382"
ID[ gsub("\\s", "", gsub(".*_", "", ID$ID)) %in% ID_not_complete$ID_not_complete,,drop=FALSE]
# ID
# 1 AX-35388475__rs16896864
# 2 AX-11425569__rs289621
# 3 AX-11102771__rs10261724
# 4 AX-42179369__rs1725074
# 5 AX-42144793569__rs1725
Upvotes: 2