La Chouette
La Chouette

Reputation: 67

Filling NA values in a panel data

I have a panel data containing NA values. I would like to fill in the NAs by the values of an other data. Let say I want to complete the following panel with new.df.

panel <- data.frame("time" = c(rep(2000,5), rep(2001,5)), 
                    "var1" = rep(1:5, times=2), 
                    "var2" = c(NA,'b','c',NA,'d','a1','b1','c1',NA,'d1'))

new.df <- data.frame("time" = c(2000:2001), 
                     "var1" = c(4,4), 
                     "var2" = c('e','e'))

I tried different combination of merge / aggregate / ddplyr etc.. The issue is that merge or merge.data.frame creates additional columns indexed by .x and .y even tho the colnames are identical.

> merge(panel,new.df,by = c("time","var1"), all=T)
   time var1 var2.x var2.y
1  2000    1   <NA>   <NA>
2  2000    2      b   <NA>
3  2000    3      c   <NA>
4  2000    4   <NA>      e
5  2000    5      d   <NA>
6  2001    1     a1   <NA>
7  2001    2     b1   <NA>
8  2001    3     c1   <NA>
9  2001    4   <NA>      e
10 2001    5     d1   <NA>

I tried also to play with the na.action option without success because the panel will still be incomplete after merging and the remaining NAmust stay as they are. (Depending on the formulation, NA treatment will in some cases replace NA by 0, or by NaN)

I would like to find a way to target the correct indexes in the panel to "insert" new.df$var2 at its right place, knowing that I have a very large panel and it will remain incomplete at the end.

Thanks in advance.

Upvotes: 2

Views: 1027

Answers (3)

Paul Rougieux
Paul Rougieux

Reputation: 11429

Recreate data as data frame

library(dplyr)
panel <- data_frame("time" = c(rep(2000,5), rep(2001,5)), 
                    "var1" = rep(1:5, times=2), 
                    "var2" = c(NA,'b','c',NA,'d','a1','b1','c1',NA,'d1'))

new.df <- data_frame("time" = c(2000:2001), 
                     "var1" = c(4,4), 
                     "var2" = c('e','e'))

Solution 1 Fill NA values with base R merge

panelnew <- merge(panel,new.df,by = c("time","var1"), all=T)
panelnew$var2 <- ifelse(is.na(panelnew$var2.x), panelnew$var2.y, panelnew$var2.x)
panelnew[c("time","var1","var2")]
   time var1 var2
1  2000    1 <NA>
2  2000    2    b
3  2000    3    c
4  2000    4    e
5  2000    5    d
6  2001    1   a1
7  2001    2   b1
8  2001    3   c1
9  2001    4    e
10 2001    5   d1

Solution 2 fill NA values with dplyr left_join and mutate

Here I use dplyr left_join to join the new values. Use full_join if you want to also add combinations of time and var1 that were not present in the original panel. You will get columns var2.x and var2.y columns and that's OK because it reflects the fact that there is an old and a new value. Then mutate to replace NA values by the new value.

result <- panel %>% 
    left_join(new.df, by = c("time", "var1")) %>% 
    mutate(var2 = ifelse(is.na(var2.x),var2.y,var2.x))

Then if you want to keep only the columns of interest

result <- result %>% select(time, var1, var2)

What do you plan to do if a new value replaces an existing value? The code above will keep the old value.

Upvotes: 1

akrun
akrun

Reputation: 887951

We can use coalesce from tidyr

library(tidyr)
library(dplyr)
full_join(as.data.frame(panel),as.data.frame(new.df),by = c("time","var1")) %>% 
       mutate_each(funs(as.character), var2.x:var2.y) %>%
       mutate(var2= coalesce(var2.x, var2.y)) %>% 
       select(-var2.x, -var2.y)
#   time var1 var2
#1  2000    1 <NA>
#2  2000    2    b
#3  2000    3    c
#4  2000    4    e
#5  2000    5    d
#6  2001    1   a1
#7  2001    2   b1
#8  2001    3   c1
#9  2001    4    e
#10 2001    5   d1

Or we can use a base R option with max.col. Here, 'd1' is the OP's output with merge

 d1$var2 <-d1[,3:4][cbind(1:nrow(d1), max.col(!is.na(d1[3:4]), "first"))]
 d1$var
 #[1] NA   "b"  "c"  "e"  "d"  "a1" "b1" "c1" "e"  "d1"

Upvotes: 2

count
count

Reputation: 1338

Or simply (assuming that all values in new.df map to a corresponding NA value in your Panel):

ind <- which(paste0(panel[,1],panel[,2]) %in% paste0(new.df[,1],new.df[,2]))
panel[ind,3] = new.df[,3]

Upvotes: 1

Related Questions