Reputation: 143
I read data from different sources into a data.table. Two sources supply different variables for the same time step.
How can I replace the missing variables by the other source (row)?
Here is a minimal example:
(gg = data.table(SourceCode = c(1,1,2,2), time = c(1,2,1,2), LE = c(10,10,NA,NA), R = c(NA,NA,20,30)))
SourceCode time LE R
1: 1 1 10 NA
2: 1 2 10 NA
3: 2 1 NA 20
4: 2 2 NA 30
> # rename SourceCode
> gg[SourceCode == 1, SourceCode := 2 ]
> gg
SourceCode time LE R
1: 2 1 10 NA
2: 2 2 10 NA
3: 2 1 NA 20
4: 2 2 NA 30
Desired output:
SourceCode time LE R
1: 2 1 10 20
2: 2 2 10 30
Upvotes: 1
Views: 53
Reputation: 8120
An option:
library(tidyverse)
dd %>%
gather(var, val, -SourceCode, -time) %>%
na.omit(val) %>%
spread(var, val)
# SourceCode time LE R
# 1 2 1 10 20
# 2 2 2 10 30
Or another option based on grouping
dd %>%
group_by(SourceCode, time) %>%
summarise_at(vars(LE:R), .funs = funs(.[which(!is.na(.))]))
# SourceCode time LE R
# 1 2 1 10 20
# 2 2 2 10 30
Note that I only add the SourceCode in the group_by
call to keep it in the summary. If you do not need that column, you could omit that.
Upvotes: 0
Reputation: 50728
Since you seem to be working with data.table
s, here is a data.table
solution
unique(gg[, `:=`(LE = LE[!is.na(LE)], R = R[!is.na(R)]), by = time], by = "time")
# SourceCode time LE R
#1: 1 1 10 20
#2: 1 2 10 30
Or for the second source
unique(gg[, `:=`(LE = LE[!is.na(LE)], R = R[!is.na(R)]), by = time], by = "time", fromLast = T)
# SourceCode time LE R
#1: 2 1 10 20
#2: 2 2 10 30
Since SourceCode
seems not relevant anymore (you summarise across different SourceCode
s) you could also do
gg[, lapply(.SD, function(x) x[!is.na(x)]), by = time, .SDcols = 3:4]
# time LE R
#1: 1 10 20
#2: 2 10 30
Upvotes: 0
Reputation: 33603
I had a similar question recently and discovered dplyr::coalesce()
:
The simple solution would be:
library(dplyr)
coalesce(
filter(gg, SourceCode == 2),
filter(gg, SourceCode == 1)
)
SourceCode time LE R
1 2 1 10 20
2 2 2 10 30
But more generalisable:
do.call(coalesce, split(gg, gg$SourceCode))
SourceCode time LE R
1: 1 1 10 20
2: 1 2 10 30
If you want to base of the second source (or the last source) you could do:
do.call(coalesce, rev(split(gg, gg$SourceCode)))
SourceCode time LE R
1: 2 1 10 20
2: 2 2 10 30
Upvotes: 1