Reputation: 127
I found that dplyr is speedy and simple for aggregate and summarise data. But I can't find out how to solve the following problem with dplyr.
Given these data frames:
df_2017 <- data.frame(
expand.grid(1:195,1:65,1:39),
value = sample(1:1000000,(195*65*39)),
period = rep("2017",(195*65*39)),
stringsAsFactors = F
)
df_2017 <- df_2017[sample(1:(195*65*39),450000),]
names(df_2017) <- c("company", "product", "acc_concept", "value", "period")
df_2017$company <- as.character(df_2017$company)
df_2017$product <- as.character(df_2017$product)
df_2017$acc_concept <- as.character(df_2017$acc_concept)
df_2017$value <- as.numeric(df_2017$value)
ratio_df <- data.frame(concept=c("numerator","numerator","numerator","denom", "denom", "denom","name"),
ratio1=c("1","","","4","","","Sales over Assets"),
ratio2=c("1","","","5","6","","Sales over Expenses A + B"), stringsAsFactors = F)
where the columns in df_2017 are:
As the expand.grid implies, the combinations of company - product - acc_concept are never duplicated, but, It could happen that certains subjects have not every company - product - acc_concept combinations. That's why the code line "df_2017 <- df_2017[sample(1:195*65*39),450000),]", and that's why the output could turn out into NA (see below).
And where the columns in ratio_df are:
I want to calculate 2 ratios (ratio_df) between acc_concept, for each product within each company.
For example:
I take the first ratio "acc_concepts" and "name" from ratio_df:
num_acc_concept <- ratio_df[ratio_df$concept == "numerator", 2]
denom_acc_concept <- ratio_df[ratio_df$concept == "denom", 2]
ratio_name <- ratio_df[ratio_df$concept == "name", 2]
Then I calculate the ratio for one product of one company, just to show you want i want to do:
ratio1_value <- sum(df_2017[df_2017$company == 1 & df_2017$product == 1 & df_2017$acc_concept %in% num_acc_concept, 4]) / sum(df_2017[df_2017$company == 1 & df_2017$product == 1 & df_2017$acc_concept %in% denom_acc_concept, 4])
Output:
output <- data.frame(Company="1", Product="1", desc_ratio=ratio_name, ratio_value = ratio1_value, stringsAsFactors = F)
As i said before i want to do this for each product within each company
The output data.frame could be something like this (ratios aren't the true ones because i haven't done the calculations yet):
company product desc_ratio ratio_value
1 1 Sales over Assets 0.9303675
1 2 Sales over Assets 1.30
1 3 Sales over Assets Nan
1 4 Sales over Assets Inf
1 5 Sales over Assets 2.32
1 6 Sales over Assets NA
.
.
.
1 1 Sales over Expenses A + B 3.25
.
.
.
2 1 Sales over Assets 0.256
and so on...
I hope i have made myself clear this time :)
Is there any way to solve this row problem with dplyr? Should I cast the df_2017 for mutating? In this case, which is the best way for casting?
Any help would be welcome!
Upvotes: 0
Views: 281
Reputation: 23608
This is one way of doing it. At the end I timed the code on all of your records.
First create a function to create all the ratios. Do note, this function is only useful inside the dplyr code.
ratio <- function(data){
result <- data.frame(desc_ratio = rep(NA, ncol(ratio_df) -1), ratio_value = rep(NA, ncol(ratio_df) -1))
for(i in 2:ncol(ratio_df)){
num <- ratio_df[ratio_df$concept == "numerator", i]
denom <- ratio_df[ratio_df$concept == "denom", i]
result$desc_ratio[i-1] <- ratio_df[ratio_df$concept == "name", i]
result$ratio_value[i-1] <- sum(ifelse(data$acc_concept %in% num, data$value, 0)) / sum(ifelse(data$acc_concept %in% denom, data$value, 0))
}
return(result)
}
Using dplyr, tidyr and purrr to put everything together. First group by the data, nest the data needed for the function, run the function with a mutate on the nested data. Drop the not needed nested data and unnest to get your wanted output. I leave the sorting up to you.
library(dplyr)
library(purrr)
library(tidyr)
output <- df_2017 %>%
group_by(company, product, period) %>%
nest() %>%
mutate(ratios = map(data, ratio)) %>%
select(-data) %>%
unnest
output
# A tibble: 25,350 x 5
company product period desc_ratio ratio_value
<chr> <chr> <chr> <chr> <dbl>
1 103 2 2017 Sales over Assets 0.733
2 103 2 2017 Sales over Expenses A + B 0.219
3 26 26 2017 Sales over Assets 0.954
4 26 26 2017 Sales over Expenses A + B 1.01
5 85 59 2017 Sales over Assets 4.14
6 85 59 2017 Sales over Expenses A + B 1.83
7 186 38 2017 Sales over Assets 7.85
8 186 38 2017 Sales over Expenses A + B 0.722
9 51 25 2017 Sales over Assets 2.34
10 51 25 2017 Sales over Expenses A + B 0.627
# ... with 25,340 more rows
Time it took to run this code on my machine measured with system.time
:
user system elapsed
6.75 0.00 6.81
Upvotes: 1