Reputation: 125
I have two daily time series with 20 stations listed as id and I am trying to merge these two dataframe in R. The dataframes columns headings are as follows.
names(data1)
"id" "year" "yday" "date" "t_1" "t_2" "t_3" "r1" "s1" "p1" "s2"
names(data2)
"id" "year" "yday" "date" "t_1" "t_2" "t_3" "r1" "s1" "p1"
I tried to merge it using the following code
newdata<- merge(data1,data2, all=TRUE)
This is only partial solution. However, for certain id, the dates between the two dataframe overlap and also for same date there are NAs in data1, while data2 do not have missing values. I am trying to merge it so that I can get rid of the duplicate dates for each id, while at the same time retain the data which is available for duplication and retain the columns headings for data1. Eg this is how it merge
id year yday date t_1 t_2 t_3 r1 s1 p1 s2
AA1111 2007 3 03/01/2007 -5.3 -11.6 -8.5 0 0 0 NA
AA1111 2007 3 03/01/2007 -5.3 -11.6 NA NA NA 0 32
I would like
id year yday date t_1 t_2 t_3 r1 s1 p1 s2
AA1111 2007 3 03/01/2007 -5.3 -11.6 -8.5 0 0 0 32
The codes above above didn't give me quite my desired results. Your guidance on how i can get my desired result would be very appreciated ( as i am still trying with R)
Upvotes: 2
Views: 704
Reputation: 3501
Without a representative data, here's an approach based on a sample data I created.
# sample merged data (hope it represents your data fully)
DF = structure(list(id = structure(c(1L, 1L, 2L, 2L, 2L), .Label = c("AA1111",
"AA1112"), class = "factor"), year = c(2007L, 2007L, 2008L, 2008L,
2008L), yday = c(3L, 3L, 3L, 3L, 3L), date = structure(c(1L,
1L, 1L, 1L, 1L), .Label = "03/01/2007", class = "factor"), t_1 = c(-5.3,
-5.3, -10.3, -10.3, NA), t_2 = c(-11.6, -11.6, -11.6, NA, -11.6
), t_3 = c(-8.5, NA, -8.5, -8.5, -8.5), r1 = c(0L, NA, 0L, 0L,
0L), s1 = c(0L, NA, 0L, 0L, 0L), p1 = c(0L, 0L, 0L, NA, 0L),
s2 = c(NA, 32L, NA, 42L, NA)), .Names = c("id", "year", "yday",
"date", "t_1", "t_2", "t_3", "r1", "s1", "p1", "s2"), class = "data.frame", row.names = c(NA,
-5L))
# id year yday date t_1 t_2 t_3 r1 s1 p1 s2
# 1 AA1111 2007 3 03/01/2007 -5.3 -11.6 -8.5 0 0 0 NA
# 2 AA1111 2007 3 03/01/2007 -5.3 -11.6 NA NA NA 0 32
# 3 AA1112 2008 3 03/01/2007 -10.3 -11.6 -8.5 0 0 0 NA
# 4 AA1112 2008 3 03/01/2007 -10.3 NA -8.5 0 0 NA 42
# 5 AA1112 2008 3 03/01/2007 NA -11.6 -8.5 0 0 0 NA
library(data.table)
setDT(DF) # convert to data table
DF_new <- DF[, names(DF)[5:11] := lapply(.SD, max, na.rm=TRUE),
by=list(id,year,yday,date), .SDcols=5:11][,unique(.SD)]
DF_new
# id year yday date t_1 t_2 t_3 r1 s1 p1 s2
# 1: AA1111 2007 3 03/01/2007 -5.3 -11.6 -8.5 0 0 0 32
# 2: AA1112 2008 3 03/01/2007 -10.3 -11.6 -8.5 0 0 0 42
setDF(DF_new) # convert back to data frame
.SD
stands for Subset of Data and contains the data for each group specified in by
. It itself is a data.table. .SDcols
argument tells which columns should .SD
contain. The syntax LHS := RHS
runs the expression in the RHS - here, loop over .SD which contains columns specified in .SDcols
and compute max
, and updates the columns specified in LHS by reference (in-place).
Upvotes: 2