Reputation: 13
I have a dataset that I basically need to roll-up and self merge.
There was an old code which used SQLDF which is highly inefficient (slow). So, I have put in dplyr summarise()
which the fastest roll-up I have seen in my life, this is better than SAS.
My dataset (data_df) has 3 identifying columns (Key
, Prod
, Subc
) and one numeric value field (Cash
) that needs to be summed up at various unique combinations of the 3 columns. Since this is a large dataset, to reduce my RAM usage to a minimum, I am trying to summarise at all 3 level combinations and keep the rolled up data in the same dataset.
Key Prod Subc Cash
K1 P1 S1 10
K2 P2 S3 30
K1 P1 S2 10
K3 P4 S4 40
Now, I'd like to add 3 new roll-up columns (Cash_K, Cash_KP, Cash_KS, Cash_KSP) in the same dataset.
Key Prod Subc Cash Cash_K Cash_KP Cash_KS Cash_KSP
K1 P1 S1 10 20 20 10 10
K2 P2 S3 30 30 30 30 30
K1 P1 S2 10 20 20 10 10
K3 P4 S4 40 40 40 40 40
Code I'm using right now makes 4 datasets:
KPS Rollup
data_df_1 <- summarise(select(group_by(data_df,Key, Subc, Prod), Cash), Cash_KSP = sum(Cash, na.rm = TRUE))
rm(data_df)
KS Rollup
data_df_2 <- summarise(select(group_by(data_df,Key, Subc), Cash_KSP), Cash_KS = sum(Cash_KSP, na.rm = TRUE))
K Rollup
data_df_3 <- summarise(select(group_by(data_df,Keyword), Cash_KS), Cash_K = sum(Cash_KS, na.rm = TRUE))
KP Rollup
data_df_4 <- summarise(select(group_by(data_df,Keyword,Product),Cash_KSP), Cash_KP = sum(Cash_KSP, na.rm = T))
For the people who prefer %>% notation:
KPS Rollup
data_df %>% group_by(Key, Subc, Prod) %>% summarise(Cash_KSP = sum(Cash, na.rm = TRUE)) %>% select (Key, Subc, Prod, Cash_KSP) etc.
So, I need "proper" roll-up only at KSP level. Other rollups would be basically a repetition of roll-up at every unique combination.
I wrote something akin to this:
KPS Rollup
data_1 <- summarise(select(group_by(data_df,Key, Subc, Prod), Cash), Cash_KSP = sum(Cash, na.rm = TRUE))
rm(data)
KS Rollup
data_2 <- summarise(select(group_by(data_1,Key, Subc),Prod, Cash_KSP), Cash_KS = sum(Cash_KSP, na.rm = TRUE))
K Rollup
data_2 <- summarise(select(group_by(data_2,Key),Subc,Prod, Cash_KS), Cash_K = sum(Cash_KS, na.rm = TRUE))
KP Rollup
data_2 <- summarise(select(group_by(data_2,Key,Prod),Subc, Cash_KSP), Cash_KP = sum(Cash_KSP, na.rm = T))
But the code fails at K Rollup (Step 3) as code is not able to keep 'Prod' column after the 2nd rollup even though it is mentioned in the select statement.
Can dplyr or any other method do it without requiring any merges? Are there any reservations whilst using dplyr?
EDIT:
Ranking on all four new columns, how do I do it? Mutate(rank(), dense_rank()) doesn't seem to work. No sorting happens, all rows get a rank = 1, any group_by combination doesn't correct it either.
data <- data[order(-data$Cash_K),]
data <- group_by(data, Key, Subc, Prod) %>% mutate(Rank_K=rank(-data$Cash_K, ties.method = 'first'))
Upvotes: 0
Views: 1355
Reputation: 83215
Instead of using summarise
, it is better to use mutate
for this:
data_df <- read.table(text="Key Prod Subc Cash
K1 P1 S1 10
K2 P2 S3 30
K1 P1 S2 10
K3 P4 S4 40", header=TRUE)
library(dplyr)
data_df <- data_df %>% group_by(Key) %>% mutate(Cash_K=sum(Cash)) %>%
group_by(Key,Prod) %>% mutate(Cash_KP=sum(Cash)) %>%
group_by(Key,Subc) %>% mutate(Cash_KS=sum(Cash)) %>%
group_by(Key,Subc,Prod) %>% mutate(Cash_KSP=sum(Cash))
which results in the following data_df
:
> data_df
Key Prod Subc Cash Cash_K Cash_KP Cash_KS Cash_KSP
1 K1 P1 S1 10 20 20 10 10
2 K2 P2 S3 30 30 30 30 30
3 K1 P1 S2 10 20 20 10 10
4 K3 P4 S4 40 40 40 40 40
When you only need the Cash_KSP
variable:
data_df <- data_df %>% group_by(Key,Subc,Prod) %>% mutate(Cash_KSP=sum(Cash))
You can arrange your dataframe with for example:
data_df <- data_df %>% arrange(Key)
which results in:
> data_df
Key Prod Subc Cash Cash_K Cash_KP Cash_KS Cash_KSP
1 K1 P1 S1 10 20 20 10 10
2 K1 P1 S2 10 20 20 10 10
3 K2 P2 S3 30 30 30 30 30
4 K3 P4 S4 40 40 40 40 40
Upvotes: 3