nik
nik

Reputation: 2584

finding similar strings in each row of two different data frame

I would like to check two data set. one data has many columns (this example has two columns df1) and one data has one column (df2)

At first, I want to check the first column of df1 each row with all part of df2 if any similar part is found, then the row number of df1 and df2 is written

for example Column 1 of df1 has two similar part of the row to df2 Q9Y6Q9 in row 3 of df1 with Q9Y6Q9 in row 4 of df2 . so the output is 3-4 , the same for others

Upvotes: 0

Views: 1091

Answers (1)

Karsten W.
Karsten W.

Reputation: 18500

Maybe you should normalize your data first. For instance, you could do:

normalize <- function(x, delim) {
    x <- gsub(")", "", x, fixed=TRUE)
    x <- gsub("(", "", x, fixed=TRUE)
    idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
    names <- unlist(strsplit(as.character(x), delim))
    return(setNames(idx, names))
}

This function can applied to each column of df1 as well as the lookup table df2:

s1 <- normalize(df1[,1], ";")
s2 <- normalize(df1[,2], ";")
lookup <- normalize(df2[,1], ",")

With this normalized data, it is easy to generate the output you are looking for:

process <- function(s) {
    lookup_try <- lookup[names(s)]
    found <- which(!is.na(lookup_try))
    pos <- lookup_try[names(s)[found]]
    return(paste(s[found], pos, sep="-"))
    #change the last line to "return(as.character(pos))" to get only the result as in the comment
}

process(s1)
# [1] "3-4" "4-1" "5-4"
process(s2)
# [1] "2-4"  "3-15" "7-16"

The output is not exactly the same as in the question, but this may be due to manual lookup errors.

In order to iterate over all columns of df1, you could use lapply:

res <- lapply(colnames(df1), function(x) process(normalize(df1[,x], ";")))
names(res) <- colnames(df1)

Now, res is a list indexed by the column names of df1:

res[["sample_1"]]
# [1] "4" "1" "4"

Upvotes: 2

Related Questions