Reputation: 41
So, I have two datasets, such that all the columns in one dataset is present in the other, along with some extra columns. What I want to do is to create a new dataset of the differences between the entries in common columns, on the basis of matching a common identifier column, present in both the datasets. How can I do that in R? If it were a single column, I could have used the sqldf function as
sqldf("select a.v1 - b.v1 from ds1 a left join ds2 b on a.identifier=b.identifier")
But there are 900 common columns between both the datasets.
Upvotes: 1
Views: 64
Reputation: 269431
Using S1 and S2 as an example (see Note at end) we find the common column names using intersect. In the example data in the Note at the end there is only one numeric common column that is not ID but the same code should work even if there are many.
Then using Filter extract the names of the numeric common columns. We have assumed that if a common column in S1 is numeric then it is also numeric in S2 so we only have to check S1. The Filter line could be omitted if we knew that all common columns were numeric.
Next ensure that the ID column is excluded using setdiff. If ID is non numeric the Filter line would have already removed it in which case we could omit the setdiff line.
Now construct the select clause. sprintf creates a character vector of the elements of the select clause and toString collapses it to a comma separated string giving the final select string. Finally run the SQL statement. Note that fn$sqldf turns on string interpolation in the SQL statement and $sel inserts the contents of the sel variable into the SQL string.
library (sqldf)
nms <- intersect(names(S1), names(S2))
nms <- names(Filter(is.numeric, S1[nms]))
nms <- setdiff(nms, "ID")
sel <- toString(sprintf("a.[%s] - b.[%s] as [%s]", nms, nms, nms))
fn$sqldf("select ID, $sel
from S1 a
left join S2 b using(ID)")
## ID extra
## 1 1 0
## 2 2 0
## 3 3 0
## 4 4 0
## 5 5 0
## 6 6 0
The nms<- lines above could alternately be written in terms of pipes:
nms <- names(S1) |>
intersect(names(S2)) |>
subset(S1, select = _) |>
Filter(f = is.numeric) |>
names() |>
setdiff("ID")
The data frame sleep comes with R. S1 and S2 are used as an example.
S1 <- head(sleep)
S2 <- S1[-2]
S1
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
S2
## extra ID
## 1 0.7 1
## 2 -1.6 2
## 3 -0.2 3
## 4 -1.2 4
## 5 -0.1 5
## 6 3.4 6
Upvotes: 1
Reputation: 24722
You can do this by simply joining the frames on identifier
, and then subtracting the one frame from the other.
Here is an example of the approach using data.table
# load the library
library(data.table)
# do a left join on identifier
merged = setDT(ds2)[setDT(ds1), on="identifier"]
# get common column names, and remove "identifier" from that vector
cols = intersect(names(ds1), names(ds2))
cols = cols[cols!="identifier"]
# column-bind the identifier column with a substraction of the two sub-frames
cbind(
merged[,.(identifier)],
setnames(merged[,.SD, .SDcols = paste0("i.",cols)] - merged[,.SD, .SDcols = cols],paste0("diff_",cols))
)
Here is the same approach using dplyr
:
library(dplyr)
merged = left_join(ds1,ds2, by="identifier")
cols = intersect(names(ds1), names(ds2))
cols = cols[cols!="identifier"]
bind_cols(
select(merged,identifier),
(select(merged, all_of(paste0(cols, ".x"))) - select(merged, all_of(paste0(cols, ".y")))) %>%
rename_with(~paste0("diff_",cols), everything())
)
Output (same under either approach):
identifier diff_v1 diff_v2 diff_v3
<char> <num> <num> <num>
1: O 0.5028498 0.7573174 -1.00630610
2: S -2.5631238 -0.7041228 1.33877932
3: N NA NA NA
4: C NA NA NA
5: J NA NA NA
6: R NA NA NA
7: K NA NA NA
8: E NA NA NA
9: X -0.1830764 0.2924459 -0.01860763
10: Y NA NA NA
11: W NA NA NA
12: T -0.4912840 -2.8126285 -1.33661044
13: I NA NA NA
14: L NA NA NA
15: U NA NA NA
16: M -0.3130889 1.1590316 -0.44551660
17: P NA NA NA
18: H NA NA NA
19: B NA NA NA
20: G -2.2817049 2.4156583 -0.34393988
21: Z NA NA NA
22: A -0.1654816 -0.8807393 -1.08534789
23: F NA NA NA
24: V NA NA NA
25: D 1.4653655 0.2604109 -0.17733840
26: Q NA NA NA
identifier diff_v1 diff_v2 diff_v3
Input:
set.seed(123)
ds1 = data.frame(identifier = sample(LETTERS),v1 = rnorm(26),v2 = rnorm(26),v3 = rnorm(26))
ds2 = data.frame(identifier = sample(LETTERS,8),v1 = rnorm(8),v2 = rnorm(8),v3 = rnorm(8))
Upvotes: 1