Kavindra M.
Kavindra M.

Reputation: 13

Summarize without reducing rows

I have a dataset that I basically need to roll-up and self merge.

There was an old code which used SQLDF which is highly inefficient (slow). So, I have put in dplyr summarise() which the fastest roll-up I have seen in my life, this is better than SAS.

My dataset (data_df) has 3 identifying columns (Key, Prod, Subc) and one numeric value field (Cash) that needs to be summed up at various unique combinations of the 3 columns. Since this is a large dataset, to reduce my RAM usage to a minimum, I am trying to summarise at all 3 level combinations and keep the rolled up data in the same dataset.

Key Prod Subc Cash
K1  P1   S1   10
K2  P2   S3   30
K1  P1   S2   10
K3  P4   S4   40

Now, I'd like to add 3 new roll-up columns (Cash_K, Cash_KP, Cash_KS, Cash_KSP) in the same dataset.

Key Prod Subc Cash Cash_K Cash_KP Cash_KS Cash_KSP
K1  P1   S1   10   20     20      10      10 
K2  P2   S3   30   30     30      30      30
K1  P1   S2   10   20     20      10      10
K3  P4   S4   40   40     40      40      40

Code I'm using right now makes 4 datasets:

KPS Rollup  
data_df_1 <- summarise(select(group_by(data_df,Key, Subc, Prod), Cash), Cash_KSP = sum(Cash, na.rm = TRUE))

rm(data_df)

KS Rollup
data_df_2 <- summarise(select(group_by(data_df,Key, Subc), Cash_KSP), Cash_KS = sum(Cash_KSP, na.rm = TRUE))

K Rollup
data_df_3 <- summarise(select(group_by(data_df,Keyword), Cash_KS), Cash_K = sum(Cash_KS, na.rm = TRUE))

KP Rollup
data_df_4 <- summarise(select(group_by(data_df,Keyword,Product),Cash_KSP), Cash_KP = sum(Cash_KSP, na.rm = T))

For the people who prefer %>% notation:

KPS Rollup 
data_df %>% group_by(Key, Subc, Prod) %>% summarise(Cash_KSP = sum(Cash, na.rm = TRUE)) %>% select (Key, Subc, Prod, Cash_KSP) etc.

So, I need "proper" roll-up only at KSP level. Other rollups would be basically a repetition of roll-up at every unique combination.

I wrote something akin to this:

KPS Rollup  
data_1 <- summarise(select(group_by(data_df,Key, Subc, Prod), Cash), Cash_KSP = sum(Cash, na.rm = TRUE))

rm(data)

KS Rollup
data_2 <- summarise(select(group_by(data_1,Key, Subc),Prod, Cash_KSP), Cash_KS = sum(Cash_KSP, na.rm = TRUE))

K Rollup
data_2 <- summarise(select(group_by(data_2,Key),Subc,Prod, Cash_KS), Cash_K = sum(Cash_KS, na.rm = TRUE))

KP Rollup
data_2 <- summarise(select(group_by(data_2,Key,Prod),Subc, Cash_KSP), Cash_KP = sum(Cash_KSP, na.rm = T))

But the code fails at K Rollup (Step 3) as code is not able to keep 'Prod' column after the 2nd rollup even though it is mentioned in the select statement.

Can dplyr or any other method do it without requiring any merges? Are there any reservations whilst using dplyr?

EDIT:

Ranking on all four new columns, how do I do it? Mutate(rank(), dense_rank()) doesn't seem to work. No sorting happens, all rows get a rank = 1, any group_by combination doesn't correct it either.

data <- data[order(-data$Cash_K),] 
data <- group_by(data, Key, Subc, Prod) %>% mutate(Rank_K=rank(-data$Cash_K, ties.method = 'first')) 

Upvotes: 0

Views: 1355

Answers (1)

Jaap
Jaap

Reputation: 83215

Instead of using summarise, it is better to use mutate for this:

data_df <- read.table(text="Key Prod Subc Cash
K1  P1   S1   10
K2  P2   S3   30
K1  P1   S2   10
K3  P4   S4   40", header=TRUE)

library(dplyr)

data_df <- data_df %>% group_by(Key) %>% mutate(Cash_K=sum(Cash)) %>%
  group_by(Key,Prod) %>% mutate(Cash_KP=sum(Cash)) %>%
  group_by(Key,Subc) %>% mutate(Cash_KS=sum(Cash)) %>%
  group_by(Key,Subc,Prod) %>% mutate(Cash_KSP=sum(Cash))

which results in the following data_df:

> data_df

  Key Prod Subc Cash Cash_K Cash_KP Cash_KS Cash_KSP
1  K1   P1   S1   10     20      20      10       10
2  K2   P2   S3   30     30      30      30       30
3  K1   P1   S2   10     20      20      10       10
4  K3   P4   S4   40     40      40      40       40

When you only need the Cash_KSP variable:

data_df <- data_df %>% group_by(Key,Subc,Prod) %>% mutate(Cash_KSP=sum(Cash))

You can arrange your dataframe with for example:

data_df <- data_df %>% arrange(Key)

which results in:

> data_df

  Key Prod Subc Cash Cash_K Cash_KP Cash_KS Cash_KSP
1  K1   P1   S1   10     20      20      10       10
2  K1   P1   S2   10     20      20      10       10
3  K2   P2   S3   30     30      30      30       30
4  K3   P4   S4   40     40      40      40       40

Upvotes: 3

Related Questions