Zoe
Zoe

Reputation: 55

Compare columns between two data frames R

I have two data frames:

c1 <- c("chr1:981994","chr1:1025751","chr2:6614300", "chr2:6784300")
c2 <- c("G/A","C/T","A/T", "T/G")

df1 <- data.frame(c1,c2)

a <- c("chr1:981994","chr1:1000000","chr2:6614300", "chr2:6784300")
b <- c("G/G","C/C","A/A", "T/T")
c <- c("G/G","C/C","A/T", "T/T")
d <- c("G/A","C/T","A/T", "T/G")
df2 <- data.frame(a, b, c, d)

I would like to compare the two data frames and if data in column 1 matches, then compare columns b, c and d against column c2 and if at least TWO columns or more from dataframe 2 (a, b, c and/or d) are different to c2 in dataframe 1 then output this row into a new data frame. So for this example the first and last row of data frame 2 would be outputted as both column 1 entries match and columns b and c are different to c2.

I am very new to R, I tried to look at the compare function but got a little overwhelmed. Would very much appreciate any help.

Upvotes: 3

Views: 10704

Answers (1)

mathematical.coffee
mathematical.coffee

Reputation: 56935

Thanks for the reproducible example. First, you can merge with merge. Have a look at ?merge for other configuration options - you can specify the column to merge on using by.x and by.y

df3 = merge(df1, df2, by.x='c1', by.y='a')
#             c1  c2   b   c   d
# 1  chr1:981994 G/A G/G G/G G/A
# 2 chr2:6614300 A/T A/A A/T A/T
# 3 chr2:6784300 T/G T/T T/T T/G

Note that the non-matching rows in df1 and df2 are left out here. Then you can just filter out the rows where c2 matches exactly one of columns b,c, d (then by definition it will not match the other two).

There are lots of ways to do this, e.g.

as.character(df3$c2) == df3[, c('b', 'c', 'd')]
#          b     c    d
# [1,] FALSE FALSE TRUE
# [2,] FALSE  TRUE TRUE
# [3,] FALSE FALSE TRUE

rows with only one TRUE are the ones you want.

df3[rowSums(as.character(df3$c2) == df3[, c('b', 'c', 'd')]) == 1, ]

Or you could simply loop over all the rows, or use something like apply:

apply(df3, 1, function (row) {
  sum(row['c2'] == row[c('b', 'c', 'd')]) == 1
})
# [1]  TRUE FALSE  TRUE
df3[.Last.value, ]

Upvotes: 3

Related Questions