thiagoveloso
thiagoveloso

Reputation: 2763

R - join data tables by pairs of keys

I have the following data tables:

> dat1
   ID1  ID2 distance
1:   1 1117  3022.22
2:   1  386 16107.74
3:   1  920 17327.00
4:   1   91 24691.42
5:   1  227 29459.34

> dat2
    ID1 ID2 common_date_begin common_date_end
1:    2   1        2000-01-01      2006-12-03
2:    3   1        2000-05-01      2006-12-03
3:    3   2        2000-05-01      2006-12-03
4: 1117   1        2000-01-01      2006-12-03
5:    4   2        2000-01-01      2006-12-03

dat1 shows some meteorological stations (ID1), their nearest neighbor stations (ID2), and the distance between them. dat2 shows the common periods of data availability for each pair of stations. The problem is, some stations are ID1 in dat1 and are ID2 in dat2. For example, stations 1 and 1117.

I want to join dat1 and dat2 using ID1 and ID2 just as pairs, with no specified order. I am doing:

dat1[dat2, on=.(ID1, ID2), (cols) := mget(paste0("i.", cols))][,
                                                             diff_days := common_date_end - common_date_begin][]

  ID1  ID2 distance common_date_begin common_date_end diff_days
1:   1 1117  3022.22              <NA>            <NA>   NA days
2:   1  386 16107.74              <NA>            <NA>   NA days
3:   1  920 17327.00              <NA>            <NA>   NA days
4:   1   91 24691.42              <NA>            <NA>   NA days
5:   1  227 29459.34              <NA>            <NA>   NA days

which doesn't work, because 1 and 1117 are under distinct identifiers. However, the result I am looking for is:

  ID1  ID2 distance common_date_begin common_date_end diff_days
1:   1 1117  3022.22          2000-01-01      2006-12-03 2528 days
2:   1  386 16107.74              <NA>            <NA>   NA days
3:   1  920 17327.00              <NA>            <NA>   NA days
4:   1   91 24691.42              <NA>            <NA>   NA days
5:   1  227 29459.34              <NA>            <NA>   NA days

Is this achievable using data.table?

Here is the data to reproduce my problem:

library(data.table)
dat1 <- read.table(text="ID1  ID2 distance
1 1117 3022.22
1 386 16107.74
1 920 17327.00
1 91  24691.42
1 227 29459.34", header=T, stringsAsFactors=F)
dat1 <- data.table(dat1)

dat2 <- read.table(text="ID1  ID2 common_date_begin common_date_end
2    1        2000-01-01      2006-12-03
3    1        2000-05-01      2006-12-03
3    2        2000-05-01      2006-12-03
1117 1        2000-01-01      2006-12-03
4    2        2000-01-01      2006-12-03", header=T, stringsAsFactors=F)
dat2 <- data.table(dat2)

# convert character to dates
cols <- c("common_date_begin", "common_date_end")
dat2[, (cols) := lapply(.SD, as.Date), .SDcols = cols]

Upvotes: 1

Views: 49

Answers (1)

chinsoon12
chinsoon12

Reputation: 25223

You can sort the columns first before joining:

dat1[, paste0("ID", 1L:2L) := .(pmin(ID1, ID2), pmax(ID1, ID2))]
dat2[, paste0("ID", 1L:2L) := .(pmin(ID1, ID2), pmax(ID1, ID2))]

Upvotes: 3

Related Questions