Match strings between two data frame and add missing strings to strings which are not complete in R

Question

I have two data frames, with 4 different columns in different orders. One data frame 1 has an ID column with strings having the complete and correct name. Then I have a data frame 2 with a ID column which is not complete (ID_not_complete). Data frame 1 has more rows and contains 100% all strings from data frame 2. I would like to add the missing strings to the ID_not_complete column in data frame 2. As it is in my example, a string in data frame 2 can have multiple matches in data frame 1, but only one matches the exact length of the string:

rs1725 --> AX-42144793569__rs1725
rs1725 --> AX-42179369__rs1725074

The first option should be the correct one.

Data frame 1

ID<-c("AX-35388475__rs16896864","AX-11425569__rs289621","AX-11102771__rs10261724","AX-42179369__rs1725074","AX-42144793569__rs1725","AX-42749369__rs264930","AX-32893019__rs6114382")
ID<-as.data.frame(ID)

Data frame 2

ID_not_complete<-c("rs16896864","rs289621","rs10261724","rs1725074","rs1725")
ID_not_complete <-as.data.frame(ID_not_complete)

The output data frame 2 should look like:

ID_complete<-c("AX-35388475__rs16896864","AX-11425569__rs289621","AX-11102771__rs10261724","AX-42179369__rs1725074","AX-42144793569__rs1725")
ID_complete <-as.data.frame(ID_complete)

I think I need to use grep. But I really don't know how to to it for each value in a column.

r2evans · Accepted Answer

The fuzzyjoin package allows for joining on regular expressions (patterns).

A first (flawed) approach is:

fuzzyjoin::regex_inner_join(ID, ID_not_complete, by = c(ID="ID_not_complete"))
#                            ID ID_not_complete
# 1     AX-35388475__rs16896864      rs16896864
# 2       AX-11425569__rs289621        rs289621
# 3 AX-11102771__rs10261724          rs10261724
# 4      AX-42179369__rs1725074       rs1725074
# 5      AX-42179369__rs1725074          rs1725
# 6      AX-42144793569__rs1725          rs1725

where rs1725 matches both rs1725 and rs1725074 (matching leading characters). I'll infer that you don't mean for this to happen, so a quick fix using some additional boundary-like patterns (also correcting for your data having spaces):

ID_not_complete$ptn <- paste0("(^|[\s_])", ID_not_complete$ID_not_complete, "([\s_]|$)")
fuzzyjoin::regex_inner_join(ID, ID_not_complete, by = c(ID="ptn"))
#                            ID ID_not_complete                            ptn
# 1     AX-35388475__rs16896864      rs16896864 (^|[\s_])rs16896864([\s_]|$)
# 2       AX-11425569__rs289621        rs289621   (^|[\s_])rs289621([\s_]|$)
# 3 AX-11102771__rs10261724          rs10261724 (^|[\s_])rs10261724([\s_]|$)
# 4      AX-42179369__rs1725074       rs1725074  (^|[\s_])rs1725074([\s_]|$)
# 5      AX-42144793569__rs1725          rs1725     (^|[\s_])rs1725([\s_]|$)

(Side note: I originally wanted and intended to use the regex word-boundary \b in the pattern, but according to https://www.regular-expressions.info/shorthand.html, it considered [A-Za-z0-9_] to be "word characters", so the _ preceding the real IDs was not a boundary. So if other users have a similar problem that does not use underscores, then the (^|[\s_]) can be replaced completely with \b, similarly for the end-pattern.)

Edit

If all you need is to filter out from ID those that are not found otherwise, then perhaps just

paste0("_(", paste(ID_not_complete$ID_not_complete, collapse = "|"), ") *$")
# [1] "_(rs16896864|rs289621|rs10261724|rs1725074|rs1725) *$"
grepl(paste0("_(", paste(ID_not_complete$ID_not_complete, collapse = "|"), ") *$"), ID$ID)
# [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
ID[grepl(paste0("_(", paste(ID_not_complete$ID_not_complete, collapse = "|"), ") *$"), ID$ID),,drop=FALSE]
#                            ID
# 1     AX-35388475__rs16896864
# 2       AX-11425569__rs289621
# 3 AX-11102771__rs10261724    
# 4      AX-42179369__rs1725074
# 5      AX-42144793569__rs1725

or

gsub("\s", "", gsub(".*_", "", ID$ID))
# [1] "rs16896864" "rs289621"   "rs10261724" "rs1725074"  "rs1725"     "rs264930"   "rs6114382" 
ID[ gsub("\s", "", gsub(".*_", "", ID$ID)) %in% ID_not_complete$ID_not_complete,,drop=FALSE]
#                            ID
# 1     AX-35388475__rs16896864
# 2       AX-11425569__rs289621
# 3 AX-11102771__rs10261724    
# 4      AX-42179369__rs1725074
# 5      AX-42144793569__rs1725

Match strings between two data frame and add missing strings to strings which are not complete in R

Answers (2)

Related Questions