Dr. Dre
Dr. Dre

Reputation: 41

Comparing two columns in R based on matching values in a common column

So, I have two datasets, such that all the columns in one dataset is present in the other, along with some extra columns. What I want to do is to create a new dataset of the differences between the entries in common columns, on the basis of matching a common identifier column, present in both the datasets. How can I do that in R? If it were a single column, I could have used the sqldf function as

sqldf("select a.v1 - b.v1 from ds1 a left join ds2 b on a.identifier=b.identifier")

But there are 900 common columns between both the datasets.

Upvotes: 1

Views: 64

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269431

Using S1 and S2 as an example (see Note at end) we find the common column names using intersect. In the example data in the Note at the end there is only one numeric common column that is not ID but the same code should work even if there are many.

Then using Filter extract the names of the numeric common columns. We have assumed that if a common column in S1 is numeric then it is also numeric in S2 so we only have to check S1. The Filter line could be omitted if we knew that all common columns were numeric.

Next ensure that the ID column is excluded using setdiff. If ID is non numeric the Filter line would have already removed it in which case we could omit the setdiff line.

Now construct the select clause. sprintf creates a character vector of the elements of the select clause and toString collapses it to a comma separated string giving the final select string. Finally run the SQL statement. Note that fn$sqldf turns on string interpolation in the SQL statement and $sel inserts the contents of the sel variable into the SQL string.

library (sqldf)

nms <- intersect(names(S1), names(S2))
nms <- names(Filter(is.numeric, S1[nms]))
nms <- setdiff(nms, "ID")

sel <- toString(sprintf("a.[%s] - b.[%s] as [%s]", nms, nms, nms))
fn$sqldf("select ID, $sel   
          from S1 a   
          left join S2 b using(ID)")
##   ID extra
## 1  1     0
## 2  2     0
## 3  3     0
## 4  4     0
## 5  5     0
## 6  6     0

Pipe

The nms<- lines above could alternately be written in terms of pipes:

nms <- names(S1) |>
  intersect(names(S2)) |>
  subset(S1, select = _) |>  
  Filter(f = is.numeric) |>
  names() |>
  setdiff("ID")

Note

The data frame sleep comes with R. S1 and S2 are used as an example.

S1 <- head(sleep)
S2 <- S1[-2]

S1
##   extra group ID
## 1   0.7     1  1
## 2  -1.6     1  2
## 3  -0.2     1  3
## 4  -1.2     1  4
## 5  -0.1     1  5
## 6   3.4     1  6

S2
##   extra ID
## 1   0.7  1
## 2  -1.6  2
## 3  -0.2  3
## 4  -1.2  4
## 5  -0.1  5
## 6   3.4  6

Upvotes: 1

langtang
langtang

Reputation: 24722

You can do this by simply joining the frames on identifier, and then subtracting the one frame from the other.

Here is an example of the approach using data.table

# load the library
library(data.table)

# do a left join on identifier
merged = setDT(ds2)[setDT(ds1), on="identifier"]

# get common column names, and remove "identifier" from that vector
cols = intersect(names(ds1), names(ds2))
cols = cols[cols!="identifier"]

# column-bind the identifier column with a substraction of the two sub-frames
cbind(
  merged[,.(identifier)],
  setnames(merged[,.SD, .SDcols = paste0("i.",cols)] - merged[,.SD, .SDcols = cols],paste0("diff_",cols))
)

Here is the same approach using dplyr:

library(dplyr)
merged = left_join(ds1,ds2, by="identifier")
cols = intersect(names(ds1), names(ds2))
cols = cols[cols!="identifier"]
bind_cols(
  select(merged,identifier),
  (select(merged, all_of(paste0(cols, ".x"))) - select(merged, all_of(paste0(cols, ".y")))) %>% 
    rename_with(~paste0("diff_",cols), everything())
)

Output (same under either approach):

    identifier    diff_v1    diff_v2     diff_v3
        <char>      <num>      <num>       <num>
 1:          O  0.5028498  0.7573174 -1.00630610
 2:          S -2.5631238 -0.7041228  1.33877932
 3:          N         NA         NA          NA
 4:          C         NA         NA          NA
 5:          J         NA         NA          NA
 6:          R         NA         NA          NA
 7:          K         NA         NA          NA
 8:          E         NA         NA          NA
 9:          X -0.1830764  0.2924459 -0.01860763
10:          Y         NA         NA          NA
11:          W         NA         NA          NA
12:          T -0.4912840 -2.8126285 -1.33661044
13:          I         NA         NA          NA
14:          L         NA         NA          NA
15:          U         NA         NA          NA
16:          M -0.3130889  1.1590316 -0.44551660
17:          P         NA         NA          NA
18:          H         NA         NA          NA
19:          B         NA         NA          NA
20:          G -2.2817049  2.4156583 -0.34393988
21:          Z         NA         NA          NA
22:          A -0.1654816 -0.8807393 -1.08534789
23:          F         NA         NA          NA
24:          V         NA         NA          NA
25:          D  1.4653655  0.2604109 -0.17733840
26:          Q         NA         NA          NA
    identifier    diff_v1    diff_v2     diff_v3

Input:

set.seed(123)
ds1 = data.frame(identifier = sample(LETTERS),v1 = rnorm(26),v2 = rnorm(26),v3 = rnorm(26))
ds2 = data.frame(identifier = sample(LETTERS,8),v1 = rnorm(8),v2 = rnorm(8),v3 = rnorm(8))

Upvotes: 1

Related Questions