Veraaa
Veraaa

Reputation: 301

Remove duplicates but keeping values in R

I have a dataframe with duplicate store/product combinations. I want to remove the duplicate values, but I want to keep the costs for these products for each year.

example dataframe:

store    product    year1  year2  year3 
H&M      shirt      20.00  29.95  NA
Mango    trousers   49.95  NA     NA
H&M      trousers   39.95  NA     39.95
Mango    trousers   NA     NA     44.95

How I want the dataset to look:

store    product    year1  year2  year3 
H&M      shirt      20.00  29.95  NA
H&M      trousers   39.95  NA     39.95
Mango    trousers   49.95  NA     44.95

I've used dplyr but this only seemed to remove the duplicates, instead of keeping all the costs values. Any help is appreciated!

reproducible code:

df <- data.frame(store= c("H&M", "Mango", "H&M", "Mango"), product=c("shirt", "trousers", "trousers", "trousers"), 
                 year1=c(20.95, 49.95, 39.95, NA), year2=c(29.95, NA, NA, NA), year3=c(NA,NA,39.95, 44.95))

Upvotes: 0

Views: 503

Answers (2)

Adi Sarid
Adi Sarid

Reputation: 819

Indeed dplyr is the way to go. First you gather() the data, then you group_by() and summarize() and eventually spread() it back, filling with NAs where missing, i.e.:

library(dplyr)
df <- data.frame(store= c("H&M", "Mango", "H&M", "Mango"), 
                 product=c("shirt", "trousers", "trousers", "trousers"), 
                 year1=c(20.95, 49.95, 39.95, NA), 
                 year2=c(29.95, NA, NA, NA), 
                 year3=c(NA,NA,39.95, 44.95))
new.df <- df %>%
  gather(year, value, -store, -product) %>%
  group_by(year, store, product) %>%
  summarize(sum.value = sum(value)) %>%
  spread(key = year, value = sum.value, fill = NA)

Using -store and -product tells gather() to ignore these two variables and gather the data by year and call the new number column "value" (you can replace this with whatever name you like).

Then group_by() and summarize() makes sure we don't run into duplicates (and use the sum of two values in case there are many rows relating to the same store and product).

Eventually spread() gives the form you are looking for.

You have to be careful with how you treat duplicates and what you assume about them. This answer assumes that if there are two rows which have the same product and store, appear twice, then the value you want as a result is the sum of year1, sum of year2 and sum of year3. If NAs are present (in the group_by() groups, you will get an NA as a result, unless you add na.rm = TRUE in the sum command, i.e.: summarize(sum.value = sum(value, na.rm = TRUE)). Then you will have 0s instead of NAs.

However, the code I supplied works for the example you supplied, and yields the tibble you wanted.

Upvotes: 1

akash87
akash87

Reputation: 3994

You can use the package dplyr.

dfn<- df %>%
  group_by(store, product) %>%
  summarise(year1 = sum(year1, na.rm = T),
            year2 = sum(year2, na.rm = T),
            year3 = sum(year3, na.rm = T))

When you print out dfn, you get

   store  product year1 year2 year3
  <fctr>   <fctr> <dbl> <dbl> <dbl>
1    H&M    shirt 20.95 29.95  0.00
2    H&M trousers 39.95  0.00 39.95
3  Mango trousers 49.95  0.00 44.95

You wanted to group by two variables, so thegroup_by function is best suited for it. I know that you want NAs for where the 0s are, and you can replace that in a subsequent line as

dfn[dfn == 0, ] <- NA

Upvotes: 1

Related Questions