Reputation: 1086
This is a sample of my data from CSV. It contains ~10 columns.
Product_id Product_Weight Product_Name Shop_Name ...
[1] A 10 xxxx Walmart
[2] B 12 yyyy Target
[3] C 11 zzzz Target
[4] A NA xxxx Walmart
[5] C NA zzzz Target
I would like to fill NA's in row 4 and 5 with 10 and 11 respectively (since the product weight of A and C are already known from row 1 and 3). I want the final data frame to be like this
Product_id Product_Weight Product_Name Shop_Name ...
[1] A 10 xxxx Walmart
[2] B 12 yyyy Target
[3] C 11 zzzz Target
[4] A 10 xxxx Walmart
[5] C 11 zzzz Target
What is the best way to do this in R?
Upvotes: 8
Views: 1062
Reputation: 269694
Although the question asked for the "previous occurrence" this would have the disadvantage that if the first Product_Weight
in any Product_id
were NA
then it could not be filled in even if we knew the Product_Weight
from a subsequent Product_id
so instead of using the previous occurrence we take the mean of all non-NAs with the same Product_id
. Since these should all be the same their mean is their common value.
If you really do want the previous occurrence use the Prev
function where:
Prev <- function(x) na.locf(x, na.rm = FALSE)
in place of na.aggregate
in (1) and (3) and do not use (2).
The following solutions have the advantages that they all:
preserve the order of the input
work even if the first Product_Weight
in any Product_id
is NA
do not modify the input
The first solution has the additional advantage of being only one line of code (plus a library
statement) and the second solution has the additional advantage of not using any packages.
1) zoo::na.aggregate We use na.aggregate
in the zoo package (which replaces all NAs with the average of the non-NAs) and we apply it to Product_Weight
separately for each Product_id
.
library(zoo)
transform(DF, Product_Weight = ave(Product_Weight, Product_id, FUN = na.aggregate))
giving:
Product_id Product_Weight Product_Name Shop_Name
1 A 10 xxxx Walmart
2 B 12 yyyy Target
3 C 11 zzzz Target
4 A 10 xxxx Walmart
5 C 11 zzzz Target
2) No packages Alternately use Mean
in place of na.aggregate
where Mean
is defined as:
Mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
3) dplyr/zoo Add row numbers, group by the Product_id
, fill in the NAs as in the prior solutions using either na.aggregate
as shown below or Mean
, arrange back to the original order and remove the row numbers:
library(dplyr)
library(zoo)
DF %>%
mutate(row = row_number()) %>%
group_by(Product_id) %>%
mutate(Product_Weight = na.aggregate(Product_Weight)) %>%
ungroup() %>%
arrange(row) %>%
select(-row)
Note: This was used for the input DF
:
Lines <- " Product_id Product_Weight Product_Name Shop_Name
A 10 xxxx Walmart
B 12 yyyy Target
C 11 zzzz Target
A NA xxxx Walmart
C NA zzzz Target"
DF <- read.table(text = Lines, header = TRUE)
Upvotes: 9
Reputation: 24079
Here is solution with base R commands:
# create lookup table with item and weight combinations
lookup<-unique(df[complete.cases(df[ ,1:2]),])
# find the NAs needing replacement: which(is.na(df$weight))
# find index in lookup tabe:match(df$a[which(is.na(df$weight))
# subset: df$weight[which(is.na(df$weight))
df$weight[which(is.na(df$weight))]<-lookup$weight[match(df$Product_id[which(is.na(df$weight))], lookup$Product_id)]
Most likely not as efficient as the dplyr/tidyr solution mentioned above.
Upvotes: 2
Reputation: 214977
Another option with dplyr
and tidyr
:
library(dplyr); library(tidyr);
df %>% group_by(Product_id) %>% fill(Product_Weight)
Source: local data frame [5 x 4]
Groups: Product_id [3]
Product_id Product_Weight Product_Name Shop_Name
(fctr) (int) (fctr) (fctr)
1 A 10 xxxx Walmart
2 A 10 xxxx Walmart
3 B 12 yyyy Target
4 C 11 zzzz Target
5 C 11 zzzz Target
The result is sorted by Product_id though.
Upvotes: 6