Reputation: 1086

Replace NA with previous occurrence

This is a sample of my data from CSV. It contains ~10 columns.

    Product_id    Product_Weight    Product_Name    Shop_Name ...
[1]    A             10                xxxx            Walmart
[2]    B             12                yyyy            Target
[3]    C             11                zzzz            Target
[4]    A             NA                xxxx            Walmart
[5]    C             NA                zzzz            Target

I would like to fill NA's in row 4 and 5 with 10 and 11 respectively (since the product weight of A and C are already known from row 1 and 3). I want the final data frame to be like this

    Product_id    Product_Weight    Product_Name    Shop_Name ...
[1]    A             10                xxxx            Walmart
[2]    B             12                yyyy            Target
[3]    C             11                zzzz            Target
[4]    A             10                xxxx            Walmart
[5]    C             11                zzzz            Target

What is the best way to do this in R?

Upvotes: 8

Answers (3)

G. Grothendieck

Reputation: 269694

Although the question asked for the "previous occurrence" this would have the disadvantage that if the first Product_Weight in any Product_id were NA then it could not be filled in even if we knew the Product_Weight from a subsequent Product_id so instead of using the previous occurrence we take the mean of all non-NAs with the same Product_id. Since these should all be the same their mean is their common value.

If you really do want the previous occurrence use the Prev function where:

Prev <- function(x) na.locf(x, na.rm = FALSE)

in place of na.aggregate in (1) and (3) and do not use (2).

The following solutions have the advantages that they all:

preserve the order of the input
work even if the first Product_Weight in any Product_id is NA
do not modify the input

The first solution has the additional advantage of being only one line of code (plus a library statement) and the second solution has the additional advantage of not using any packages.

1) zoo::na.aggregate We use na.aggregate in the zoo package (which replaces all NAs with the average of the non-NAs) and we apply it to Product_Weight separately for each Product_id.

library(zoo)
transform(DF, Product_Weight = ave(Product_Weight, Product_id, FUN = na.aggregate))

giving:

  Product_id Product_Weight Product_Name Shop_Name
1          A             10         xxxx   Walmart
2          B             12         yyyy    Target
3          C             11         zzzz    Target
4          A             10         xxxx   Walmart
5          C             11         zzzz    Target

2) No packages Alternately use Mean in place of na.aggregate where Mean is defined as:

Mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))

3) dplyr/zoo Add row numbers, group by the Product_id, fill in the NAs as in the prior solutions using either na.aggregate as shown below or Mean, arrange back to the original order and remove the row numbers:

library(dplyr)
library(zoo)

DF %>% 
   mutate(row = row_number()) %>% 
   group_by(Product_id) %>% 
   mutate(Product_Weight = na.aggregate(Product_Weight)) %>% 
   ungroup() %>% 
   arrange(row) %>% 
   select(-row)

Note: This was used for the input DF :

Lines <- "    Product_id    Product_Weight    Product_Name    Shop_Name
    A             10                xxxx            Walmart
    B             12                yyyy            Target
    C             11                zzzz            Target
    A             NA                xxxx            Walmart
    C             NA                zzzz            Target"
DF <- read.table(text = Lines, header = TRUE)

Upvotes: 9

Dave2e

Reputation: 24079

Here is solution with base R commands:

#   create lookup table with item and weight combinations
lookup<-unique(df[complete.cases(df[ ,1:2]),])

#     find the NAs needing replacement: which(is.na(df$weight))
#     find index in lookup tabe:match(df$a[which(is.na(df$weight))
#     subset: df$weight[which(is.na(df$weight))
df$weight[which(is.na(df$weight))]<-lookup$weight[match(df$Product_id[which(is.na(df$weight))], lookup$Product_id)]

Most likely not as efficient as the dplyr/tidyr solution mentioned above.

Upvotes: 2

akuiper

Reputation: 214977

Another option with dplyr and tidyr:

library(dplyr); library(tidyr);
df %>% group_by(Product_id) %>% fill(Product_Weight)

Source: local data frame [5 x 4]
Groups: Product_id [3]

  Product_id Product_Weight Product_Name Shop_Name
      (fctr)          (int)       (fctr)    (fctr)
1          A             10         xxxx   Walmart
2          A             10         xxxx   Walmart
3          B             12         yyyy    Target
4          C             11         zzzz    Target
5          C             11         zzzz    Target

The result is sorted by Product_id though.

Upvotes: 6

Replace NA with previous occurrence

Answers (3)

Related Questions