Reputation: 45
Im currently working with some currencies in R and want to merge or override two datasets to create one.
For all the currencies, I have data from 1980 to 2017 in DF1
. For 16 of them I also have data starting from different places between 1970 and 1975 to 2017 in DF2
. What I want to do is to put the 1970-1980 part from DF2
on top of DF1
. I guess I would achieve the same result if I managed to merge DF1
into DF2
for the cells where they both have values (so DF1
overrides DF2
). However, the starting dates are not exactly equal for all currencies, so I can't just hardcode it.
Here is an example to show you what it looks like:
Date
is the time variable (monthly data). DF1
would be corresponding to the 1980-2017 data and DF2
would correspond to the 1970-2017 data. My objective is to override the values from DF1
into DF2
based on both column and row ID. DF3
is the desired output, without the NAs from DF1, just the values from DF2 instead.
set.seed(1234)
DF1=data.frame(matrix(data=c(c(4:9),rnorm(30)),6,6))
set.seed(4321)
DF2=data.frame(matrix(data=c(c(1:12),rnorm(36)),12,4))
names(DF1)=c("Date","Currency1","Currency3","Currency6","Currency7","Currency8")
names(DF2)=c("Date","Currency1","Currency2","Currency3")
DF1$Currency3[1:2]=NA
DF1$Currency1[4:5]=NA
> DF1
Date Currency1 Currency3 Currency6 Currency7 Currency8
1 4 -1.2070657 NA -0.77625389 -0.8371717 -0.6937202
2 5 0.2774292 NA 0.06445882 2.4158352 -1.4482049
3 6 1.0844412 -0.5644520 0.95949406 0.1340882 0.5747557
4 7 NA -0.8900378 -0.11028549 -0.4906859 -1.0236557
5 8 NA -0.4771927 -0.51100951 -0.4405479 -0.0151383
6 9 0.5060559 -0.9983864 -0.91119542 0.4595894 -0.9359486
> DF2
Date Currency1 Currency2 Currency3
1 1 -0.42675738 -1.260985237 -0.09920208
2 2 -0.22361182 1.139464085 -0.23803425
3 3 0.71760679 -1.221781923 0.04778266
4 4 0.84144567 1.573315888 0.29651274
5 5 -0.12835727 0.073477874 -0.83380992
6 6 1.60934721 -1.175115087 -1.37397000
7 7 -0.29716745 -1.588261899 0.14027895
8 8 0.19600465 -0.747380729 0.66212596
9 9 1.24074620 0.483521864 1.13103967
10 10 -0.71869815 -0.003025539 -0.47511202
11 11 -0.06723632 -0.008930402 0.85241411
12 12 0.34436710 0.593357619 -0.75151885
I got this answer from a user, but I encountered a new issue, some of the data in DF1
has NA
s in it, which this code overrides into DF2
.
library(data.table)
DF3 <- copy(DF2)
nm1 <- names(DF1)[-1]
setDT(DF3)[DF1, (nm1) := mget(paste0("i.", nm1)), on = .(Date)]
> DF3
Date Currency1 Currency2 Currency3 Currency6 Currency7 Currency8
1: 1 1.1022975 -0.8553646 -0.162309524 NA NA NA
2: 2 -0.4755931 -0.2806230 0.563055819 NA NA NA
3: 3 -0.7094400 -0.9943401 1.647817473 NA NA NA
4: 4 -1.2070657 -0.9685143 NA -0.77625389 -0.8371717 -0.6937202
5: 5 0.2774292 -1.1073182 NA 0.06445882 2.4158352 -1.4482049
6: 6 1.0844412 -1.2519859 -0.564451999 0.95949406 0.1340882 0.5747557
7: 7 NA -0.5238281 -0.890037829 -0.11028549 -0.4906859 -1.0236557
8: 8 NA -0.4968500 -0.477192700 -0.51100951 -0.4405479 -0.0151383
9: 9 0.5060559 -1.8060313 -0.998386445 -0.91119542 0.4595894 -0.9359486
10: 10 -0.4658975 -0.5820759 -0.669633580 NA NA NA
11: 11 1.4494963 -1.1088896 -0.007604756 NA NA NA
12: 12 -1.0686427 -1.0149620 1.777084448 NA NA NA
Upvotes: 1
Views: 377
Reputation: 270195
1) sqldf Perform a left join of DF1
and DF2
using coalesce
to pick out the first non-missing among its arguments:
library(sqldf)
sqldf("select Date,
coalesce(DF1.Currency1, DF2.Currency1) Currency1,
DF2.Currency2,
coalesce(DF1.Currency3, DF2.Currency3) Currency3
from DF2 left join DF1 using (Date)")
giving:
Date Currency1 Currency2 Currency3
1 1 1.12493092 -0.15579551 1.1000254
2 2 -0.04493361 -1.47075238 0.7631757
3 3 -0.01619026 -0.47815006 -0.1645236
4 4 -0.83562860 0.41794156 -0.2533617
5 5 1.59528080 1.35867955 0.6969634
6 6 0.32950780 -0.10278773 1.5117812
7 7 -0.82046840 0.38767161 0.3898432
8 8 0.48742910 -0.05380504 -0.6212406
9 9 0.73832470 -1.37705956 -2.2146999
10 10 -1.98935170 -0.41499456 0.7685329
11 11 0.61982575 -0.39428995 -0.1123462
12 12 -0.05612874 -0.05931340 0.8811077
The names could be paramterized like this:
Date <- names(DF2)[1]
Currency1 <- names(DF2)[2]
Currency2 <- names(DF2)[3]
Currency3 <- names(DF2)[4]
fn$sqldf("select [$Date],
coalesce(DF1.[$Currency1], DF2.[$Currency1]) [$Currency1],
DF2.[$Currency2],
coalesce(DF1.[$Currency3], DF2.[$Currency3]) [$Currency3]
from DF2 left join DF1 using ([$Date])")
The square brackets are only needed in case the arbitrary names contain special characters. If that is known not to be the case then the square brackets could be omitted.
Another way to parameterize is to rename the columns of the data frames to standard names, carry out the computation using the standard names and then restore the original names. Here DF1
and DF2
have arbitrary column names whereas DF1x
and DF2x
have the standard names used above. This approach could work more generally (e.g. it would also work with the dplyr code shown in (2) ).
DF1x <- setNames(DF1, c("Date", "Currency1", "Currency2", "Currency3")
DF2x <- setNames(DF2, c("Date", "Currency1", "Currency3"))
s <- sqldf("select Date,
coalesce(DF1x.Currency1, DF2x.Currency1) Currency1,
DF2x.Currency2,
coalesce(DF1x.Currency3, DF2x.Currency3) Currency3
from DF2x left join DF1x using (Date)")
names(s) <- names(DF2)
2) dplyr dplyr can do it in essentially the same manner and even has a coalesce
function:
library(dplyr)
DF2 %>%
left_join(DF1, by = "Date") %>%
transmute(Date,
Currency1 = coalesce(Currency1.y, Currency1.x),
Currency2,
Currency3 = coalesce(Currency3.y, Currency3.x))
giving:
Date Currency1 Currency2 Currency3
1 1 1.12493092 -0.15579551 1.1000254
2 2 -0.04493361 -1.47075238 0.7631757
3 3 -0.01619026 -0.47815006 -0.1645236
4 4 -0.83562860 0.41794156 -0.2533617
5 5 1.59528080 1.35867955 0.6969634
6 6 0.32950780 -0.10278773 1.5117812
7 7 -0.82046840 0.38767161 0.3898432
8 8 0.48742910 -0.05380504 -0.6212406
9 9 0.73832470 -1.37705956 -2.2146999
10 10 -1.98935170 -0.41499456 0.7685329
11 11 0.61982575 -0.39428995 -0.1123462
12 12 -0.05612874 -0.05931340 0.8811077
Note: The original code defining the inputs in the question did not use set.seed
making it not reproducible so we used the following which corresponds to the inputs originally shown in the question. (Since then the question has been fixed to add set.seed
.)
DF1 <-
structure(list(Date = 4:9, Currency1 = c(-0.8356286, 1.5952808,
0.3295078, -0.8204684, 0.4874291, 0.7383247), Currency3 = c(NA,
NA, 1.5117812, 0.3898432, -0.6212406, -2.2146999)), .Names = c("Date",
"Currency1", "Currency3"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
DF2 <-
structure(list(Date = 1:12, Currency1 = c(1.12493092, -0.04493361,
-0.01619026, 0.94383621, 0.8212212, 0.59390132, 0.91897737, 0.7821363,
0.07456498, -1.9893517, 0.61982575, -0.05612874), Currency2 = c(-0.15579551,
-1.47075238, -0.47815006, 0.41794156, 1.35867955, -0.10278773,
0.38767161, -0.05380504, -1.37705956, -0.41499456, -0.39428995,
-0.0593134), Currency3 = c(1.1000254, 0.7631757, -0.1645236,
-0.2533617, 0.6969634, 0.5566632, -0.6887557, -0.7074952, 0.364582,
0.7685329, -0.1123462, 0.8811077)), .Names = c("Date", "Currency1",
"Currency2", "Currency3"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
Upvotes: 1
Reputation: 83275
To replace only the values in DF3
that have non-missing counterparts in DF1
you could use ifelse
.
Using:
DF3 <- copy(DF2)
nm1 <- names(DF1)[-1]
nm2 <- names(DF2)
setDT(DF3)[DF1, (nm1) := {n <- seq_along(nm1);
lapply(n, function(i) ifelse(is.na(get(paste0("i.", nm1[i]))) & nm1[i] %in% nm2,
get(paste0("x.", nm1[i])),
get(paste0("i.", nm1[i]))))},
on = .(Date)]
you get:
> DF3
Date Currency1 Currency2 Currency3 Currency6 Currency7 Currency8
1: 1 -0.42675738 -1.260985237 -0.09920208 NA NA NA
2: 2 -0.22361182 1.139464085 -0.23803425 NA NA NA
3: 3 0.71760679 -1.221781923 0.04778266 NA NA NA
4: 4 -1.20706575 1.573315888 0.29651274 -0.77625389 -0.8371717 -0.6937202
5: 5 0.27742924 0.073477874 -0.83380992 0.06445882 2.4158352 -1.4482049
6: 6 1.08444118 -1.175115087 -0.56445200 0.95949406 0.1340882 0.5747557
7: 7 -0.29716745 -1.588261899 -0.89003783 -0.11028549 -0.4906859 -1.0236557
8: 8 0.19600465 -0.747380729 -0.47719270 -0.51100951 -0.4405479 -0.0151383
9: 9 0.50605589 0.483521864 -0.99838644 -0.91119542 NA -0.9359486
10: 10 -0.71869815 -0.003025539 -0.47511202 NA NA NA
11: 11 -0.06723632 -0.008930402 0.85241411 NA NA NA
12: 12 0.34436710 0.593357619 -0.75151885 NA NA NA
Used data:
set.seed(1234)
DF1=data.frame(matrix(data=c(c(4:9),rnorm(30)),6,6))
set.seed(4321)
DF2=data.frame(matrix(data=c(c(1:12),rnorm(36)),12,4))
names(DF1)=c("Date","Currency1","Currency3","Currency6","Currency7","Currency8")
names(DF2)=c("Date","Currency1","Currency2","Currency3")
DF1$Currency3[1:2]=NA
DF1$Currency1[4:5]=NA
DF1$Currency7[nrow(DF1)]=NA
Upvotes: 1