nee
nee

Reputation: 125

merging two daily time series with overlapping dates

I have two daily time series with 20 stations listed as id and I am trying to merge these two dataframe in R. The dataframes columns headings are as follows.

names(data1)
"id"     "year"   "yday"   "date"   "t_1"  "t_2"  "t_3" "r1"   "s1"   "p1" "s2"
names(data2)
"id"     "year"   "yday"   "date"   "t_1"  "t_2"  "t_3" "r1"   "s1"   "p1"

I tried to merge it using the following code

newdata<- merge(data1,data2, all=TRUE)

This is only partial solution. However, for certain id, the dates between the two dataframe overlap and also for same date there are NAs in data1, while data2 do not have missing values. I am trying to merge it so that I can get rid of the duplicate dates for each id, while at the same time retain the data which is available for duplication and retain the columns headings for data1. Eg this is how it merge

id     year     yday    date     t_1    t_2  t_3     r1 s1  p1  s2

AA1111  2007    3   03/01/2007  -5.3    -11.6   -8.5    0   0   0   NA          
AA1111  2007    3   03/01/2007  -5.3    -11.6   NA  NA  NA  0   32

I would like

id     year     yday    date     t_1    t_2     t_3   r1    s1  p1  s2

AA1111  2007    3   03/01/2007  -5.3    -11.6   -8.5    0   0   0   32

The codes above above didn't give me quite my desired results. Your guidance on how i can get my desired result would be very appreciated ( as i am still trying with R)

Upvotes: 2

Views: 704

Answers (1)

KFB
KFB

Reputation: 3501

Without a representative data, here's an approach based on a sample data I created.

# sample merged data (hope it represents your data fully)
DF = structure(list(id = structure(c(1L, 1L, 2L, 2L, 2L), .Label = c("AA1111", 
"AA1112"), class = "factor"), year = c(2007L, 2007L, 2008L, 2008L, 
2008L), yday = c(3L, 3L, 3L, 3L, 3L), date = structure(c(1L, 
1L, 1L, 1L, 1L), .Label = "03/01/2007", class = "factor"), t_1 = c(-5.3, 
-5.3, -10.3, -10.3, NA), t_2 = c(-11.6, -11.6, -11.6, NA, -11.6
), t_3 = c(-8.5, NA, -8.5, -8.5, -8.5), r1 = c(0L, NA, 0L, 0L, 
0L), s1 = c(0L, NA, 0L, 0L, 0L), p1 = c(0L, 0L, 0L, NA, 0L), 
    s2 = c(NA, 32L, NA, 42L, NA)), .Names = c("id", "year", "yday", 
"date", "t_1", "t_2", "t_3", "r1", "s1", "p1", "s2"), class = "data.frame", row.names = c(NA, 
-5L))

#       id year yday       date   t_1   t_2  t_3 r1 s1 p1 s2
# 1 AA1111 2007    3 03/01/2007  -5.3 -11.6 -8.5  0  0  0 NA
# 2 AA1111 2007    3 03/01/2007  -5.3 -11.6   NA NA NA  0 32
# 3 AA1112 2008    3 03/01/2007 -10.3 -11.6 -8.5  0  0  0 NA
# 4 AA1112 2008    3 03/01/2007 -10.3    NA -8.5  0  0 NA 42
# 5 AA1112 2008    3 03/01/2007    NA -11.6 -8.5  0  0  0 NA

library(data.table)
setDT(DF) # convert to data table
DF_new <- DF[, names(DF)[5:11] := lapply(.SD, max, na.rm=TRUE),
             by=list(id,year,yday,date), .SDcols=5:11][,unique(.SD)]
DF_new
#        id year yday       date   t_1   t_2  t_3 r1 s1 p1 s2
# 1: AA1111 2007    3 03/01/2007  -5.3 -11.6 -8.5  0  0  0 32
# 2: AA1112 2008    3 03/01/2007 -10.3 -11.6 -8.5  0  0  0 42

setDF(DF_new) # convert back to data frame

.SD stands for Subset of Data and contains the data for each group specified in by. It itself is a data.table. .SDcols argument tells which columns should .SD contain. The syntax LHS := RHS runs the expression in the RHS - here, loop over .SD which contains columns specified in .SDcols and compute max, and updates the columns specified in LHS by reference (in-place).

Upvotes: 2

Related Questions