Reputation: 331
I haven't found a solution on the web because it is not easy to come up with the right question of the problem. I have two data.frames, x and y, and want to combine them to get z:
The tricky thing is that z compares the date value of x and y, and takes the most recent observations to update A, B, C and D. Hence "dynamically" update/combine.
x=data.frame(c("2000-01-01","2000-06-01","2001-01-01"),c("100","100","100"),c("200","200","200"))
colnames(x)=c("Date","A","B")
y=data.frame(c("2000-01-05","2000-04-09"),c("10","0"),c("0","35"))
colnames(y)=c("Date","C","D")
z=data.frame(c("2000-01-01","2000-01-05","2000-04-09","2000-06-01","2001-01-01"),c("100","100","100","100","100"),c("200","200","200","200","200"),c("0","10","10","0","0"),c("0","0","35","0","0"))
colnames(z)=c("Date","A","B","C","D")
x$Date = as.Date(x$Date)
y$Date = as.Date(y$Date)
Question: How to get to z
by an efficient code?
To illustrate:
> x
Date A B
1 2000-01-01 100 200
2 2000-06-01 100 200
3 2001-01-01 100 200
> y
Date C D
1 2000-01-05 10 0
2 2000-04-09 0 35
> z
Date A B C D
1 2000-01-01 100 200 0 0
2 2000-01-05 100 200 10 0
3 2000-04-09 100 200 10 35
4 2000-06-01 100 200 10 35
5 2001-01-01 100 200 10 35
>
EDIT: Thanks for the answers below. The solutions seems to be a simple full join followed by a loop in a loop (I figured out the second step):
x$Date = as.Date(x$Date)
y$Date = as.Date(y$Date)
tt=merge(x,y,by='Date',all=TRUE)
for (i in 2:(ncol(x)+ncol(y)-1)){
for (j in 2:(nrow(x)+nrow(y))){
if (is.na(tt[j,i])==TRUE & is.na(tt[j-1,i])==FALSE){
tt[j,i]=tt[j-1,i]}
}
}
EDIT2: The solutions posted by others further below seem to be more efficient. Just for completeness, my longer solutions works if the 0 in y are replaced by NA, i.e. defining y as :
y=data.frame(c("2000-01-05","2000-04-09"),c("10",NA),c(NA,"35"))
colnames(y)=c("Date","C","D")
and then replacing the NAs in z in a final step.
I learnt from my first EDIT and I am not editing the original problem above to avoid confusion.
Thanks a lot for your help!
Upvotes: 0
Views: 364
Reputation: 83225
A possible solution could be using a combination of the data.table
and the na.locf
function from the zoo
packackages:
# loading the needed packages
library(data.table)
library(zoo)
# converting x & y to datatables
setDT(x)
setDT(y)
# merge x & y into z
z <- merge(x, y, by="Date", all=TRUE) # this works in base R as well
# fill the NA's with the last observation
cols <- c("A","B","C","D") # in this specific case, you can also use: LETTERS[1:4]
z[, (cols) := lapply(.SD, na.locf, rule = 1, na.rm=FALSE), .SDcols= cols]
this gives:
> z
Date A B C D
1: 2000-01-01 100 200 NA NA
2: 2000-01-05 100 200 10 0
3: 2000-04-09 100 200 0 35
4: 2000-06-01 100 200 0 35
5: 2001-01-01 100 200 0 35
This result can also be achieved in base R as mentioned by @Tensibai in the comments (which for some reason didn't work on my system at first):
z <- merge(x, y, by="Date", all=TRUE)
z <- na.locf(z)
To get the exact desired output, you will need some additional steps (omitting the first steps as they are the same):
# merge x & y into z
z <- merge(x, y, by="Date", all=TRUE) # this works in base R as well
# replace the zero with NA
z[z==0] <- NA
# fill the NA's with the last observation
cols <- LETTERS[1:4]
z[, (cols) := lapply(.SD, na.locf, rule = 1, na.rm=FALSE), .SDcols= cols]
# replace the remaining NA's with zero's
z[is.na(z)] <- 0
this gives:
> z
Date A B C D
1: 2000-01-01 100 200 0 0
2: 2000-01-05 100 200 10 0
3: 2000-04-09 100 200 10 35
4: 2000-06-01 100 200 10 35
5: 2001-01-01 100 200 10 35
In base R you would do:
z <- merge(x, y, by="Date", all=TRUE)
z[z==0] <- NA
z <- na.locf(z)
z[is.na(z)] <- 0
to get the same result.
Upvotes: 3
Reputation: 16121
An alternative approach using dplyr and some functions:
library(lubridate)
library(dplyr)
# dataset
x=data.frame(c("2000-01-01","2000-06-01","2001-01-01"),
c("100","100","100"),
c("200","200","200"), stringsAsFactors = F)
colnames(x)=c("Date","A","B")
y=data.frame(c("2000-01-05","2000-04-09"),
c("10","0"),
c("0","35"), stringsAsFactors = F)
colnames(y)=c("Date","C","D")
# update date columns
x$Date = ymd(x$Date)
y$Date = ymd(y$Date)
# function that replaces NAs with 0s
ff = function(x){x[is.na(x)]=0
return(as.numeric(x))}
# function that updates zero elements with the previous ones
ff2 = function(x){
for (i in 2:length(x)){x[i] = ifelse(x[i]==0, x[i-1], x[i])}
return(x)
}
# create the full dataset
xy =
x %>%
full_join(y, by="Date") %>%
arrange(Date)
xy
# Date A B C D
# 1 2000-01-01 100 200 <NA> <NA>
# 2 2000-01-05 <NA> <NA> 10 0
# 3 2000-04-09 <NA> <NA> 0 35
# 4 2000-06-01 100 200 <NA> <NA>
# 5 2001-01-01 100 200 <NA> <NA>
xy %>%
group_by(Date) %>%
mutate_each(funs(ff)) %>%
ungroup %>%
select(-Date) %>%
mutate_each(funs(ff2)) %>%
bind_cols(data.frame(Date=xy$Date)) %>%
select(Date,A,B,C,D)
# Date A B C D
# 1 2000-01-01 100 200 0 0
# 2 2000-01-05 100 200 10 0
# 3 2000-04-09 100 200 10 35
# 4 2000-06-01 100 200 10 35
# 5 2001-01-01 100 200 10 35
Upvotes: 0