Comparing two columns in R based on matching values in a common column

Question

So, I have two datasets, such that all the columns in one dataset is present in the other, along with some extra columns. What I want to do is to create a new dataset of the differences between the entries in common columns, on the basis of matching a common identifier column, present in both the datasets. How can I do that in R? If it were a single column, I could have used the sqldf function as

sqldf("select a.v1 - b.v1 from ds1 a left join ds2 b on a.identifier=b.identifier")

But there are 900 common columns between both the datasets.

langtang · Accepted Answer

You can do this by simply joining the frames on identifier, and then subtracting the one frame from the other.

Here is an example of the approach using data.table

# load the library
library(data.table)

# do a left join on identifier
merged = setDT(ds2)[setDT(ds1), on="identifier"]

# get common column names, and remove "identifier" from that vector
cols = intersect(names(ds1), names(ds2))
cols = cols[cols!="identifier"]

# column-bind the identifier column with a substraction of the two sub-frames
cbind(
  merged[,.(identifier)],
  setnames(merged[,.SD, .SDcols = paste0("i.",cols)] - merged[,.SD, .SDcols = cols],paste0("diff_",cols))
)

Here is the same approach using dplyr:

library(dplyr)
merged = left_join(ds1,ds2, by="identifier")
cols = intersect(names(ds1), names(ds2))
cols = cols[cols!="identifier"]
bind_cols(
  select(merged,identifier),
  (select(merged, all_of(paste0(cols, ".x"))) - select(merged, all_of(paste0(cols, ".y")))) %>% 
    rename_with(~paste0("diff_",cols), everything())
)

Output (same under either approach):

    identifier    diff_v1    diff_v2     diff_v3
                           
 1:          O  0.5028498  0.7573174 -1.00630610
 2:          S -2.5631238 -0.7041228  1.33877932
 3:          N         NA         NA          NA
 4:          C         NA         NA          NA
 5:          J         NA         NA          NA
 6:          R         NA         NA          NA
 7:          K         NA         NA          NA
 8:          E         NA         NA          NA
 9:          X -0.1830764  0.2924459 -0.01860763
10:          Y         NA         NA          NA
11:          W         NA         NA          NA
12:          T -0.4912840 -2.8126285 -1.33661044
13:          I         NA         NA          NA
14:          L         NA         NA          NA
15:          U         NA         NA          NA
16:          M -0.3130889  1.1590316 -0.44551660
17:          P         NA         NA          NA
18:          H         NA         NA          NA
19:          B         NA         NA          NA
20:          G -2.2817049  2.4156583 -0.34393988
21:          Z         NA         NA          NA
22:          A -0.1654816 -0.8807393 -1.08534789
23:          F         NA         NA          NA
24:          V         NA         NA          NA
25:          D  1.4653655  0.2604109 -0.17733840
26:          Q         NA         NA          NA
    identifier    diff_v1    diff_v2     diff_v3

Input:

set.seed(123)
ds1 = data.frame(identifier = sample(LETTERS),v1 = rnorm(26),v2 = rnorm(26),v3 = rnorm(26))
ds2 = data.frame(identifier = sample(LETTERS,8),v1 = rnorm(8),v2 = rnorm(8),v3 = rnorm(8))

Comparing two columns in R based on matching values in a common column

Answers (2)

Pipe

Note

Related Questions