NielsJo
NielsJo

Reputation: 45

Merge two dataframes by time and variable ID with NAs

Im currently working with some currencies in R and want to merge or override two datasets to create one.

For all the currencies, I have data from 1980 to 2017 in DF1. For 16 of them I also have data starting from different places between 1970 and 1975 to 2017 in DF2. What I want to do is to put the 1970-1980 part from DF2 on top of DF1. I guess I would achieve the same result if I managed to merge DF1 into DF2 for the cells where they both have values (so DF1 overrides DF2). However, the starting dates are not exactly equal for all currencies, so I can't just hardcode it.

Here is an example to show you what it looks like: Dateis the time variable (monthly data). DF1 would be corresponding to the 1980-2017 data and DF2 would correspond to the 1970-2017 data. My objective is to override the values from DF1 into DF2 based on both column and row ID. DF3 is the desired output, without the NAs from DF1, just the values from DF2 instead.

set.seed(1234)
DF1=data.frame(matrix(data=c(c(4:9),rnorm(30)),6,6))
set.seed(4321)
DF2=data.frame(matrix(data=c(c(1:12),rnorm(36)),12,4))
names(DF1)=c("Date","Currency1","Currency3","Currency6","Currency7","Currency8")
names(DF2)=c("Date","Currency1","Currency2","Currency3")
DF1$Currency3[1:2]=NA
DF1$Currency1[4:5]=NA

> DF1
  Date  Currency1  Currency3   Currency6  Currency7  Currency8
1    4 -1.2070657         NA -0.77625389 -0.8371717 -0.6937202
2    5  0.2774292         NA  0.06445882  2.4158352 -1.4482049
3    6  1.0844412 -0.5644520  0.95949406  0.1340882  0.5747557
4    7         NA -0.8900378 -0.11028549 -0.4906859 -1.0236557
5    8         NA -0.4771927 -0.51100951 -0.4405479 -0.0151383
6    9  0.5060559 -0.9983864 -0.91119542  0.4595894 -0.9359486

> DF2
   Date   Currency1    Currency2   Currency3
1     1 -0.42675738 -1.260985237 -0.09920208
2     2 -0.22361182  1.139464085 -0.23803425
3     3  0.71760679 -1.221781923  0.04778266
4     4  0.84144567  1.573315888  0.29651274
5     5 -0.12835727  0.073477874 -0.83380992
6     6  1.60934721 -1.175115087 -1.37397000
7     7 -0.29716745 -1.588261899  0.14027895
8     8  0.19600465 -0.747380729  0.66212596
9     9  1.24074620  0.483521864  1.13103967
10   10 -0.71869815 -0.003025539 -0.47511202
11   11 -0.06723632 -0.008930402  0.85241411
12   12  0.34436710  0.593357619 -0.75151885

I got this answer from a user, but I encountered a new issue, some of the data in DF1 has NAs in it, which this code overrides into DF2.

library(data.table)
DF3 <- copy(DF2)
nm1 <- names(DF1)[-1]
setDT(DF3)[DF1, (nm1) := mget(paste0("i.", nm1)), on = .(Date)]

    > DF3
    Date  Currency1  Currency2    Currency3   Currency6  Currency7  Currency8
 1:    1  1.1022975 -0.8553646 -0.162309524          NA         NA         NA
 2:    2 -0.4755931 -0.2806230  0.563055819          NA         NA         NA
 3:    3 -0.7094400 -0.9943401  1.647817473          NA         NA         NA
 4:    4 -1.2070657 -0.9685143           NA -0.77625389 -0.8371717 -0.6937202
 5:    5  0.2774292 -1.1073182           NA  0.06445882  2.4158352 -1.4482049
 6:    6  1.0844412 -1.2519859 -0.564451999  0.95949406  0.1340882  0.5747557
 7:    7         NA -0.5238281 -0.890037829 -0.11028549 -0.4906859 -1.0236557
 8:    8         NA -0.4968500 -0.477192700 -0.51100951 -0.4405479 -0.0151383
 9:    9  0.5060559 -1.8060313 -0.998386445 -0.91119542  0.4595894 -0.9359486
10:   10 -0.4658975 -0.5820759 -0.669633580          NA         NA         NA
11:   11  1.4494963 -1.1088896 -0.007604756          NA         NA         NA
12:   12 -1.0686427 -1.0149620  1.777084448          NA         NA         NA

Upvotes: 1

Views: 377

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 270195

1) sqldf Perform a left join of DF1 and DF2 using coalesce to pick out the first non-missing among its arguments:

library(sqldf)
sqldf("select Date,
              coalesce(DF1.Currency1, DF2.Currency1) Currency1,
              DF2.Currency2,
              coalesce(DF1.Currency3, DF2.Currency3) Currency3
       from DF2 left join DF1 using (Date)")

giving:

   Date   Currency1   Currency2  Currency3
1     1  1.12493092 -0.15579551  1.1000254
2     2 -0.04493361 -1.47075238  0.7631757
3     3 -0.01619026 -0.47815006 -0.1645236
4     4 -0.83562860  0.41794156 -0.2533617
5     5  1.59528080  1.35867955  0.6969634
6     6  0.32950780 -0.10278773  1.5117812
7     7 -0.82046840  0.38767161  0.3898432
8     8  0.48742910 -0.05380504 -0.6212406
9     9  0.73832470 -1.37705956 -2.2146999
10   10 -1.98935170 -0.41499456  0.7685329
11   11  0.61982575 -0.39428995 -0.1123462
12   12 -0.05612874 -0.05931340  0.8811077

The names could be paramterized like this:

Date <- names(DF2)[1]
Currency1 <- names(DF2)[2]
Currency2 <- names(DF2)[3]
Currency3 <- names(DF2)[4]

fn$sqldf("select [$Date],
              coalesce(DF1.[$Currency1], DF2.[$Currency1]) [$Currency1],
              DF2.[$Currency2],
              coalesce(DF1.[$Currency3], DF2.[$Currency3]) [$Currency3]
       from DF2 left join DF1 using ([$Date])")

The square brackets are only needed in case the arbitrary names contain special characters. If that is known not to be the case then the square brackets could be omitted.

Another way to parameterize is to rename the columns of the data frames to standard names, carry out the computation using the standard names and then restore the original names. Here DF1 and DF2 have arbitrary column names whereas DF1x and DF2x have the standard names used above. This approach could work more generally (e.g. it would also work with the dplyr code shown in (2) ).

DF1x <- setNames(DF1, c("Date", "Currency1", "Currency2", "Currency3")
DF2x <- setNames(DF2, c("Date", "Currency1", "Currency3"))

s <- sqldf("select Date,
              coalesce(DF1x.Currency1, DF2x.Currency1) Currency1,
              DF2x.Currency2,
              coalesce(DF1x.Currency3, DF2x.Currency3) Currency3
       from DF2x left join DF1x using (Date)")

names(s) <- names(DF2)

2) dplyr dplyr can do it in essentially the same manner and even has a coalesce function:

library(dplyr)
DF2 %>% 
     left_join(DF1, by = "Date") %>%
     transmute(Date,
               Currency1 = coalesce(Currency1.y, Currency1.x),
               Currency2,
               Currency3 = coalesce(Currency3.y, Currency3.x))

giving:

   Date   Currency1   Currency2  Currency3
1     1  1.12493092 -0.15579551  1.1000254
2     2 -0.04493361 -1.47075238  0.7631757
3     3 -0.01619026 -0.47815006 -0.1645236
4     4 -0.83562860  0.41794156 -0.2533617
5     5  1.59528080  1.35867955  0.6969634
6     6  0.32950780 -0.10278773  1.5117812
7     7 -0.82046840  0.38767161  0.3898432
8     8  0.48742910 -0.05380504 -0.6212406
9     9  0.73832470 -1.37705956 -2.2146999
10   10 -1.98935170 -0.41499456  0.7685329
11   11  0.61982575 -0.39428995 -0.1123462
12   12 -0.05612874 -0.05931340  0.8811077

Note: The original code defining the inputs in the question did not use set.seed making it not reproducible so we used the following which corresponds to the inputs originally shown in the question. (Since then the question has been fixed to add set.seed.)

DF1 <-
structure(list(Date = 4:9, Currency1 = c(-0.8356286, 1.5952808, 
0.3295078, -0.8204684, 0.4874291, 0.7383247), Currency3 = c(NA, 
NA, 1.5117812, 0.3898432, -0.6212406, -2.2146999)), .Names = c("Date", 
"Currency1", "Currency3"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

DF2 <-
structure(list(Date = 1:12, Currency1 = c(1.12493092, -0.04493361, 
-0.01619026, 0.94383621, 0.8212212, 0.59390132, 0.91897737, 0.7821363, 
0.07456498, -1.9893517, 0.61982575, -0.05612874), Currency2 = c(-0.15579551, 
-1.47075238, -0.47815006, 0.41794156, 1.35867955, -0.10278773, 
0.38767161, -0.05380504, -1.37705956, -0.41499456, -0.39428995, 
-0.0593134), Currency3 = c(1.1000254, 0.7631757, -0.1645236, 
-0.2533617, 0.6969634, 0.5566632, -0.6887557, -0.7074952, 0.364582, 
0.7685329, -0.1123462, 0.8811077)), .Names = c("Date", "Currency1", 
"Currency2", "Currency3"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

Upvotes: 1

Jaap
Jaap

Reputation: 83275

To replace only the values in DF3 that have non-missing counterparts in DF1 you could use ifelse.

Using:

DF3 <- copy(DF2)
nm1 <- names(DF1)[-1]
nm2 <- names(DF2)
setDT(DF3)[DF1, (nm1) := {n <- seq_along(nm1);
                          lapply(n, function(i) ifelse(is.na(get(paste0("i.", nm1[i]))) & nm1[i] %in% nm2, 
                                                       get(paste0("x.", nm1[i])), 
                                                       get(paste0("i.", nm1[i]))))},
           on = .(Date)]

you get:

> DF3
    Date   Currency1    Currency2   Currency3   Currency6  Currency7  Currency8
 1:    1 -0.42675738 -1.260985237 -0.09920208          NA         NA         NA
 2:    2 -0.22361182  1.139464085 -0.23803425          NA         NA         NA
 3:    3  0.71760679 -1.221781923  0.04778266          NA         NA         NA
 4:    4 -1.20706575  1.573315888  0.29651274 -0.77625389 -0.8371717 -0.6937202
 5:    5  0.27742924  0.073477874 -0.83380992  0.06445882  2.4158352 -1.4482049
 6:    6  1.08444118 -1.175115087 -0.56445200  0.95949406  0.1340882  0.5747557
 7:    7 -0.29716745 -1.588261899 -0.89003783 -0.11028549 -0.4906859 -1.0236557
 8:    8  0.19600465 -0.747380729 -0.47719270 -0.51100951 -0.4405479 -0.0151383
 9:    9  0.50605589  0.483521864 -0.99838644 -0.91119542         NA -0.9359486
10:   10 -0.71869815 -0.003025539 -0.47511202          NA         NA         NA
11:   11 -0.06723632 -0.008930402  0.85241411          NA         NA         NA
12:   12  0.34436710  0.593357619 -0.75151885          NA         NA         NA

Used data:

set.seed(1234)
DF1=data.frame(matrix(data=c(c(4:9),rnorm(30)),6,6))
set.seed(4321)
DF2=data.frame(matrix(data=c(c(1:12),rnorm(36)),12,4))
names(DF1)=c("Date","Currency1","Currency3","Currency6","Currency7","Currency8")
names(DF2)=c("Date","Currency1","Currency2","Currency3")
DF1$Currency3[1:2]=NA
DF1$Currency1[4:5]=NA
DF1$Currency7[nrow(DF1)]=NA

Upvotes: 1

Related Questions