Reputation: 67
I have a panel data containing NA values. I would like to fill in the NAs by the values of an other data. Let say I want to complete the following panel
with new.df
.
panel <- data.frame("time" = c(rep(2000,5), rep(2001,5)),
"var1" = rep(1:5, times=2),
"var2" = c(NA,'b','c',NA,'d','a1','b1','c1',NA,'d1'))
new.df <- data.frame("time" = c(2000:2001),
"var1" = c(4,4),
"var2" = c('e','e'))
I tried different combination of merge / aggregate / ddplyr etc.. The issue is that merge
or merge.data.frame
creates additional columns indexed by .x
and .y
even tho the colnames are identical.
> merge(panel,new.df,by = c("time","var1"), all=T)
time var1 var2.x var2.y
1 2000 1 <NA> <NA>
2 2000 2 b <NA>
3 2000 3 c <NA>
4 2000 4 <NA> e
5 2000 5 d <NA>
6 2001 1 a1 <NA>
7 2001 2 b1 <NA>
8 2001 3 c1 <NA>
9 2001 4 <NA> e
10 2001 5 d1 <NA>
I tried also to play with the na.action
option without success because the panel will still be incomplete after merging and the remaining NA
must stay as they are. (Depending on the formulation, NA treatment will in some cases replace NA
by 0
, or by NaN
)
I would like to find a way to target the correct indexes in the panel to "insert" new.df$var2
at its right place, knowing that I have a very large panel and it will remain incomplete at the end.
Thanks in advance.
Upvotes: 2
Views: 1027
Reputation: 11429
library(dplyr)
panel <- data_frame("time" = c(rep(2000,5), rep(2001,5)),
"var1" = rep(1:5, times=2),
"var2" = c(NA,'b','c',NA,'d','a1','b1','c1',NA,'d1'))
new.df <- data_frame("time" = c(2000:2001),
"var1" = c(4,4),
"var2" = c('e','e'))
panelnew <- merge(panel,new.df,by = c("time","var1"), all=T)
panelnew$var2 <- ifelse(is.na(panelnew$var2.x), panelnew$var2.y, panelnew$var2.x)
panelnew[c("time","var1","var2")]
time var1 var2
1 2000 1 <NA>
2 2000 2 b
3 2000 3 c
4 2000 4 e
5 2000 5 d
6 2001 1 a1
7 2001 2 b1
8 2001 3 c1
9 2001 4 e
10 2001 5 d1
Here I use dplyr left_join
to join the new values. Use full_join
if you want to also add combinations of time and var1 that were not present in the original panel. You will get columns var2.x
and var2.y
columns and that's OK because it reflects the fact that there is an old and a new value. Then mutate to replace NA values by the new value.
result <- panel %>%
left_join(new.df, by = c("time", "var1")) %>%
mutate(var2 = ifelse(is.na(var2.x),var2.y,var2.x))
Then if you want to keep only the columns of interest
result <- result %>% select(time, var1, var2)
What do you plan to do if a new value replaces an existing value? The code above will keep the old value.
Upvotes: 1
Reputation: 887951
We can use coalesce
from tidyr
library(tidyr)
library(dplyr)
full_join(as.data.frame(panel),as.data.frame(new.df),by = c("time","var1")) %>%
mutate_each(funs(as.character), var2.x:var2.y) %>%
mutate(var2= coalesce(var2.x, var2.y)) %>%
select(-var2.x, -var2.y)
# time var1 var2
#1 2000 1 <NA>
#2 2000 2 b
#3 2000 3 c
#4 2000 4 e
#5 2000 5 d
#6 2001 1 a1
#7 2001 2 b1
#8 2001 3 c1
#9 2001 4 e
#10 2001 5 d1
Or we can use a base R
option with max.col
. Here, 'd1' is the OP's output with merge
d1$var2 <-d1[,3:4][cbind(1:nrow(d1), max.col(!is.na(d1[3:4]), "first"))]
d1$var
#[1] NA "b" "c" "e" "d" "a1" "b1" "c1" "e" "d1"
Upvotes: 2
Reputation: 1338
Or simply (assuming that all values in new.df map to a corresponding NA value in your Panel):
ind <- which(paste0(panel[,1],panel[,2]) %in% paste0(new.df[,1],new.df[,2]))
panel[ind,3] = new.df[,3]
Upvotes: 1