Reputation: 1433
So I have a dataframe which I would like to transform and summarize it. It currently looks like this
SAMPLE DATA
sample_date<-data.frame(stringsAsFactors=FALSE,
Date = c("2019-06-04", "2019-06-05", "2019-06-06",
"2019-06-07", "2019-06-08", "2019-06-09"),
apple_cost = c(6685.5601, 17387.3072, 31587.2694, 7489.1275,
8490.1844, 9372.4676),
apple_count = c(601762, 557952, 1003681, 243348, 273511, 303130),
banana_cost = c(6685, 6685, 6685, 6685, 6685, 6685),
banana_count = c(557952, 557952, 557952, 557952, 557952, 557952),
orange_cost = c(6685, 6685, 6685, 6685, 6685, 6685),
orange_count = c(1003681, 1003681, 1003681, 1003681, 1003681, 1003681)
)
Date apple_cost apple_count banana_cost banana_count orange_cost orange_count
1 2019-06-04 6685.560 601762 6685 557952 6685 1003681
2 2019-06-05 17387.307 557952 6685 557952 6685 1003681
3 2019-06-06 31587.269 1003681 6685 557952 6685 1003681
4 2019-06-07 7489.127 243348 6685 557952 6685 1003681
5 2019-06-08 8490.184 273511 6685 557952 6685 1003681
6 2019-06-09 9372.468 303130 6685 557952 6685 1003681
I would like to transform and summarize it to :
EXPECTED OUTPUT
Type = c("apple","banana","orange"),
cost = c(243348,343348,443348),
count = c(3003681,4003681,5003681))
Type cost count
1 apple 243348 3003681
2 banana 343348 4003681
3 orange 443348 5003681
I did try to summarize it with the following code but it doesnt seem like the above expected output. This is what I have tried so far
WHAT I TRIED SO FAR
current_table <- sample_date %>%
summarise( apple_cost = sum( apple_cost, na.rm=TRUE),
apple_count = sum( apple_count, na.rm=TRUE),
banana_cost = sum( banana_cost, na.rm=TRUE),
banana_count = sum(banana_count, na.rm=TRUE),
orange_cost = sum(orange_cost, na.rm=TRUE),
orange_count = sum(orange_count, na.rm=TRUE))
apple_cost apple_count banana_cost banana_count orange_cost orange_count
1 81011.92 2983384 40110 3347712 40110 6022086
Upvotes: 2
Views: 85
Reputation: 887691
An option would be
library(dplyr)# dply_0.8.3
library(tidyr) #tidyr_0.8.3.9000
sample_date %>%
pivot_longer(-Date, names_to = c(".value", "fruits"), names_sep='_') %>%
select(-Date) %>%
group_by(fruits) %>%
summarise_all(sum)
Upvotes: 2
Reputation: 4824
The approach you are taking is verbose and requires lots of column-specific and column-name-specific code.
The idea of dplyr
and of the tidyverse is to use tidy data principles, usually involving reshaping the dataframes into longer formats with one record per row.
You can do that with code like this:
sample_date %>%
gather(column, value, -Date) %>%
separate(column, into=c('fruit', 'parameter'), sep='_') %>%
spread(parameter, value) %>%
group_by(fruit) %>%
summarize(total_cost = sum(cost),
total_count = sum(count))
Here, gather()
reshapes the data into a "long" format, so that there is a unique row for each date-fruit-parameter combination. ("Parameter" here is either "cost" or "count".)
Based on your expected output, though, perhaps a more natural view of the data is to have each unique date-fruit combination as a separate row, but with separate columns for cost
and count
. That is easily done by the separate()
and spread()
functions. separate()
(and gather()
and spread()
) is from tidyr
, not dplyr
, but those packages are often used together. First, separate()
transforms the column
column into two columns, fruit
and parameter
. Then, spread()
decomposes or "widens" the value
column into two columns, using the parameter
column as a key. The result at that point looks like this:
Date fruit cost count
2019-06-04 apple 6685.560 601762
2019-06-04 banana 6685.000 557952
2019-06-04 orange 6685.000 1003681
2019-06-05 apple 17387.307 557952
2019-06-05 banana 6685.000 557952
2019-06-05 orange 6685.000 1003681
2019-06-06 apple 31587.269 1003681
2019-06-06 banana 6685.000 557952
2019-06-06 orange 6685.000 1003681
2019-06-07 apple 7489.127 243348
2019-06-07 banana 6685.000 557952
2019-06-07 orange 6685.000 1003681
2019-06-08 apple 8490.184 273511
2019-06-08 banana 6685.000 557952
2019-06-08 orange 6685.000 1003681
2019-06-09 apple 9372.468 303130
2019-06-09 banana 6685.000 557952
2019-06-09 orange 6685.000 1003681
Your desired output seems like you want to sum over all dates, but to handle each fruit type separately. That's why we group_by(fruit)
before doing summarize()
.
The final output of the whole code block looks like
fruit total_cost total_count
apple 81011.92 2983384
banana 40110.00 3347712
orange 40110.00 6022086
This isn't exactly the format you want, but it's a more natural one for the tidyverse.
The values you expect don't seem to emerge from the sample data you provided. The code you did try gives the "right" numbers, at least as far as I can tell. For example, 81011.92
is the sum of all the apple costs, not 243348
as you say you expect.
Upvotes: 1