Luker354
Luker354

Reputation: 669

Match strings between two data frame and add missing strings to strings which are not complete in R

I have two data frames, with 4 different columns in different orders. One data frame 1 has an ID column with strings having the complete and correct name. Then I have a data frame 2 with a ID column which is not complete (ID_not_complete). Data frame 1 has more rows and contains 100% all strings from data frame 2. I would like to add the missing strings to the ID_not_complete column in data frame 2. As it is in my example, a string in data frame 2 can have multiple matches in data frame 1, but only one matches the exact length of the string:

  1. rs1725 --> AX-42144793569__rs1725
  2. rs1725 --> AX-42179369__rs1725074

The first option should be the correct one.

Data frame 1

ID<-c("AX-35388475__rs16896864","AX-11425569__rs289621","AX-11102771__rs10261724","AX-42179369__rs1725074","AX-42144793569__rs1725","AX-42749369__rs264930","AX-32893019__rs6114382")
ID<-as.data.frame(ID)

Data frame 2

ID_not_complete<-c("rs16896864","rs289621","rs10261724","rs1725074","rs1725")
ID_not_complete <-as.data.frame(ID_not_complete)

The output data frame 2 should look like:

ID_complete<-c("AX-35388475__rs16896864","AX-11425569__rs289621","AX-11102771__rs10261724","AX-42179369__rs1725074","AX-42144793569__rs1725")
ID_complete <-as.data.frame(ID_complete)

I think I need to use grep. But I really don't know how to to it for each value in a column.

Upvotes: 0

Views: 231

Answers (2)

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

EDIT:

Data:

ID <- c("AX-35388475__rs16896864","AX-11425569__rs289621","AX-11102771__rs10261724","AX-42179369__rs1725074","AX-42144793569__rs1725","AX-42749369__rs264930","AX-32893019__rs6114382")

df1 <- data.frame(
  ID = ID
)

ID_not_complete<-c("rs16896864","rs289621","rs10261724","rs1725074","rs1725")

df2 <- data.frame(
  ID_not_complete = ID_not_complete
)

Solution: First define patterns:

patt0 <- paste0(sub("([^_]+)__(.*)", "\\2", df1$ID), collapse = "|")
patt1 <- paste0(df2$ID_not_complete, collapse = "|")

Now transform df2$ID_not_complete:

df2$ID_not_complete <- paste0(sub("([^_]+)__(.*)", "\\1__", df1$ID[grepl(patt1, df1$ID)]),
                          grep(patt0, df2$ID_not_complete, value = T))

Result:

df2
          ID_not_complete
1 AX-35388475__rs16896864
2   AX-11425569__rs289621
3 AX-11102771__rs10261724
4  AX-42179369__rs1725074
5  AX-42144793569__rs1725

Upvotes: 1

r2evans
r2evans

Reputation: 160447

The fuzzyjoin package allows for joining on regular expressions (patterns).

A first (flawed) approach is:

fuzzyjoin::regex_inner_join(ID, ID_not_complete, by = c(ID="ID_not_complete"))
#                            ID ID_not_complete
# 1     AX-35388475__rs16896864      rs16896864
# 2       AX-11425569__rs289621        rs289621
# 3 AX-11102771__rs10261724          rs10261724
# 4      AX-42179369__rs1725074       rs1725074
# 5      AX-42179369__rs1725074          rs1725
# 6      AX-42144793569__rs1725          rs1725

where rs1725 matches both rs1725 and rs1725074 (matching leading characters). I'll infer that you don't mean for this to happen, so a quick fix using some additional boundary-like patterns (also correcting for your data having spaces):

ID_not_complete$ptn <- paste0("(^|[\\s_])", ID_not_complete$ID_not_complete, "([\\s_]|$)")
fuzzyjoin::regex_inner_join(ID, ID_not_complete, by = c(ID="ptn"))
#                            ID ID_not_complete                            ptn
# 1     AX-35388475__rs16896864      rs16896864 (^|[\\s_])rs16896864([\\s_]|$)
# 2       AX-11425569__rs289621        rs289621   (^|[\\s_])rs289621([\\s_]|$)
# 3 AX-11102771__rs10261724          rs10261724 (^|[\\s_])rs10261724([\\s_]|$)
# 4      AX-42179369__rs1725074       rs1725074  (^|[\\s_])rs1725074([\\s_]|$)
# 5      AX-42144793569__rs1725          rs1725     (^|[\\s_])rs1725([\\s_]|$)

(Side note: I originally wanted and intended to use the regex word-boundary \b in the pattern, but according to https://www.regular-expressions.info/shorthand.html, it considered [A-Za-z0-9_] to be "word characters", so the _ preceding the real IDs was not a boundary. So if other users have a similar problem that does not use underscores, then the (^|[\\s_]) can be replaced completely with \\b, similarly for the end-pattern.)


Edit

If all you need is to filter out from ID those that are not found otherwise, then perhaps just

paste0("_(", paste(ID_not_complete$ID_not_complete, collapse = "|"), ") *$")
# [1] "_(rs16896864|rs289621|rs10261724|rs1725074|rs1725) *$"
grepl(paste0("_(", paste(ID_not_complete$ID_not_complete, collapse = "|"), ") *$"), ID$ID)
# [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
ID[grepl(paste0("_(", paste(ID_not_complete$ID_not_complete, collapse = "|"), ") *$"), ID$ID),,drop=FALSE]
#                            ID
# 1     AX-35388475__rs16896864
# 2       AX-11425569__rs289621
# 3 AX-11102771__rs10261724    
# 4      AX-42179369__rs1725074
# 5      AX-42144793569__rs1725

or

gsub("\\s", "", gsub(".*_", "", ID$ID))
# [1] "rs16896864" "rs289621"   "rs10261724" "rs1725074"  "rs1725"     "rs264930"   "rs6114382" 
ID[ gsub("\\s", "", gsub(".*_", "", ID$ID)) %in% ID_not_complete$ID_not_complete,,drop=FALSE]
#                            ID
# 1     AX-35388475__rs16896864
# 2       AX-11425569__rs289621
# 3 AX-11102771__rs10261724    
# 4      AX-42179369__rs1725074
# 5      AX-42144793569__rs1725

Upvotes: 2

Related Questions