Reputation: 301
I have a dataframe with duplicate store/product combinations. I want to remove the duplicate values, but I want to keep the costs for these products for each year.
example dataframe:
store product year1 year2 year3
H&M shirt 20.00 29.95 NA
Mango trousers 49.95 NA NA
H&M trousers 39.95 NA 39.95
Mango trousers NA NA 44.95
How I want the dataset to look:
store product year1 year2 year3
H&M shirt 20.00 29.95 NA
H&M trousers 39.95 NA 39.95
Mango trousers 49.95 NA 44.95
I've used dplyr but this only seemed to remove the duplicates, instead of keeping all the costs values. Any help is appreciated!
reproducible code:
df <- data.frame(store= c("H&M", "Mango", "H&M", "Mango"), product=c("shirt", "trousers", "trousers", "trousers"),
year1=c(20.95, 49.95, 39.95, NA), year2=c(29.95, NA, NA, NA), year3=c(NA,NA,39.95, 44.95))
Upvotes: 0
Views: 503
Reputation: 819
Indeed dplyr
is the way to go.
First you gather()
the data, then you group_by()
and summarize()
and eventually spread()
it back, filling with NAs where missing, i.e.:
library(dplyr)
df <- data.frame(store= c("H&M", "Mango", "H&M", "Mango"),
product=c("shirt", "trousers", "trousers", "trousers"),
year1=c(20.95, 49.95, 39.95, NA),
year2=c(29.95, NA, NA, NA),
year3=c(NA,NA,39.95, 44.95))
new.df <- df %>%
gather(year, value, -store, -product) %>%
group_by(year, store, product) %>%
summarize(sum.value = sum(value)) %>%
spread(key = year, value = sum.value, fill = NA)
Using -store
and -product
tells gather()
to ignore these two variables and gather the data by year and call the new number column "value" (you can replace this with whatever name you like).
Then group_by()
and summarize()
makes sure we don't run into duplicates (and use the sum of two values in case there are many rows relating to the same store and product).
Eventually spread()
gives the form you are looking for.
You have to be careful with how you treat duplicates and what you assume about them. This answer assumes that if there are two rows which have the same product and store, appear twice, then the value you want as a result is the sum of year1, sum of year2 and sum of year3. If NAs are present (in the group_by()
groups, you will get an NA as a result, unless you add na.rm = TRUE
in the sum command, i.e.: summarize(sum.value = sum(value, na.rm = TRUE))
. Then you will have 0s instead of NAs.
However, the code I supplied works for the example you supplied, and yields the tibble you wanted.
Upvotes: 1
Reputation: 3994
You can use the package dplyr.
dfn<- df %>%
group_by(store, product) %>%
summarise(year1 = sum(year1, na.rm = T),
year2 = sum(year2, na.rm = T),
year3 = sum(year3, na.rm = T))
When you print out dfn, you get
store product year1 year2 year3
<fctr> <fctr> <dbl> <dbl> <dbl>
1 H&M shirt 20.95 29.95 0.00
2 H&M trousers 39.95 0.00 39.95
3 Mango trousers 49.95 0.00 44.95
You wanted to group by two variables, so thegroup_by
function is best suited for it. I know that you want NAs for where the 0s are, and you can replace that in a subsequent line as
dfn[dfn == 0, ] <- NA
Upvotes: 1