SNT
SNT

Reputation: 1433

Transform a dataframe and summarize it in R

So I have a dataframe which I would like to transform and summarize it. It currently looks like this

SAMPLE DATA

sample_date<-data.frame(stringsAsFactors=FALSE,
           Date = c("2019-06-04", "2019-06-05", "2019-06-06",
                    "2019-06-07", "2019-06-08", "2019-06-09"),
           apple_cost = c(6685.5601, 17387.3072, 31587.2694, 7489.1275,
                         8490.1844, 9372.4676),
           apple_count = c(601762, 557952, 1003681, 243348, 273511, 303130),
           banana_cost = c(6685, 6685, 6685, 6685, 6685, 6685),
           banana_count = c(557952, 557952, 557952, 557952, 557952, 557952),
           orange_cost = c(6685, 6685, 6685, 6685, 6685, 6685),
           orange_count = c(1003681, 1003681, 1003681, 1003681, 1003681, 1003681)
)
        Date apple_cost apple_count banana_cost banana_count orange_cost orange_count
1 2019-06-04   6685.560      601762        6685       557952        6685      1003681
2 2019-06-05  17387.307      557952        6685       557952        6685      1003681
3 2019-06-06  31587.269     1003681        6685       557952        6685      1003681
4 2019-06-07   7489.127      243348        6685       557952        6685      1003681
5 2019-06-08   8490.184      273511        6685       557952        6685      1003681
6 2019-06-09   9372.468      303130        6685       557952        6685      1003681

I would like to transform and summarize it to :

EXPECTED OUTPUT

                         Type = c("apple","banana","orange"),
                         cost = c(243348,343348,443348),
                         count = c(3003681,4003681,5003681))
Type   cost   count
1  apple 243348 3003681
2 banana 343348 4003681
3 orange 443348 5003681

I did try to summarize it with the following code but it doesnt seem like the above expected output. This is what I have tried so far

WHAT I TRIED SO FAR

current_table <- sample_date %>% 
  summarise( apple_cost = sum( apple_cost, na.rm=TRUE),
             apple_count = sum( apple_count, na.rm=TRUE),
             banana_cost = sum( banana_cost, na.rm=TRUE),
             banana_count = sum(banana_count, na.rm=TRUE),
             orange_cost = sum(orange_cost, na.rm=TRUE),
             orange_count = sum(orange_count, na.rm=TRUE))
apple_cost apple_count banana_cost banana_count orange_cost orange_count
1   81011.92     2983384       40110      3347712       40110      6022086

Upvotes: 2

Views: 85

Answers (2)

akrun
akrun

Reputation: 887691

An option would be

library(dplyr)# dply_0.8.3
library(tidyr) #tidyr_0.8.3.9000 
sample_date %>% 
  pivot_longer(-Date, names_to = c(".value", "fruits"), names_sep='_') %>% 
  select(-Date) %>%
  group_by(fruits) %>%
  summarise_all(sum) 

Upvotes: 2

Curt F.
Curt F.

Reputation: 4824

A tidy approach

The approach you are taking is verbose and requires lots of column-specific and column-name-specific code.

The idea of dplyr and of the tidyverse is to use tidy data principles, usually involving reshaping the dataframes into longer formats with one record per row.

You can do that with code like this:

sample_date %>% 
    gather(column, value, -Date) %>%
    separate(column, into=c('fruit', 'parameter'), sep='_') %>%
    spread(parameter, value) %>% 
    group_by(fruit) %>%
    summarize(total_cost = sum(cost),
              total_count = sum(count))

Here, gather() reshapes the data into a "long" format, so that there is a unique row for each date-fruit-parameter combination. ("Parameter" here is either "cost" or "count".)

Based on your expected output, though, perhaps a more natural view of the data is to have each unique date-fruit combination as a separate row, but with separate columns for cost and count. That is easily done by the separate() and spread() functions. separate() (and gather() and spread()) is from tidyr, not dplyr, but those packages are often used together. First, separate() transforms the column column into two columns, fruit and parameter. Then, spread() decomposes or "widens" the value column into two columns, using the parameter column as a key. The result at that point looks like this:

Date    fruit   cost    count
2019-06-04  apple   6685.560    601762
2019-06-04  banana  6685.000    557952
2019-06-04  orange  6685.000    1003681
2019-06-05  apple   17387.307   557952
2019-06-05  banana  6685.000    557952
2019-06-05  orange  6685.000    1003681
2019-06-06  apple   31587.269   1003681
2019-06-06  banana  6685.000    557952
2019-06-06  orange  6685.000    1003681
2019-06-07  apple   7489.127    243348
2019-06-07  banana  6685.000    557952
2019-06-07  orange  6685.000    1003681
2019-06-08  apple   8490.184    273511
2019-06-08  banana  6685.000    557952
2019-06-08  orange  6685.000    1003681
2019-06-09  apple   9372.468    303130
2019-06-09  banana  6685.000    557952
2019-06-09  orange  6685.000    1003681 

Your desired output seems like you want to sum over all dates, but to handle each fruit type separately. That's why we group_by(fruit) before doing summarize().

The final output of the whole code block looks like

 fruit  total_cost  total_count
apple   81011.92    2983384
banana  40110.00    3347712
orange  40110.00    6022086 

This isn't exactly the format you want, but it's a more natural one for the tidyverse.

Discrepancies between your provided data and expected output

The values you expect don't seem to emerge from the sample data you provided. The code you did try gives the "right" numbers, at least as far as I can tell. For example, 81011.92 is the sum of all the apple costs, not 243348 as you say you expect.

Upvotes: 1

Related Questions