Reputation: 93
I have a dataframe in R (p2.df) that has aggregated a range of values into the following (there are many more columns this is just an abridge version):
genre rating cc dd ee
Adventure FAILURE 140393 20865 358806
Adventure SUCCESS 197182 32872 492874
Fiction FAILURE 140043 14833 308602
Fiction SUCCESS 197725 28848 469879
Sci-fi FAILURE 8681 1682 24259
Sci-fi SUCCESS 7439 1647 22661
I want to get the net values of the proportions for each column, which I can get in a spreadsheet but can't in R studio.
The formula in the spreadsheet follows the pattern:
net_cc = (cc(success)/(cc(success)+dd(success)+ee(success)) - (cc(fail)/(cc(fail)+dd(fail)+ee(fail))
What I want to get out in R is this table that I can get from the spreadsheet:
genre net_cc net_dd net_ee
Adventure 0.002801373059 0.005350579467 -0.008151952526
Fiction -0.01825346696 0.009417699223 0.008835767735
Sci-fi -0.01641517271 0.003297091109 0.0131180816
Any ideas how? If it's any use I created the p2.df by summarising a previous table as:
library(dplyr)
p2.df<- s2.df %>% group_by(genre,rating) %>% summarise_all(sum)
Upvotes: 2
Views: 122
Reputation: 4505
My answer is very close to @MKR answer, however, I just wish to point out, that we can make use of decoded rating
(SUCESS = 1
and FAILURE
= -1`) variable to avoid subsetting in the last part:
df %>%
mutate(rating = (rating == "SUCCESS")*2 - 1, denom = rowSums(.[3:5])) %>%
group_by(genre) %>%
summarise_at(vars(cc:ee), funs(sum(rating * . / denom)))
# A tibble: 3 x 4
# genre cc dd ee
# <chr> <dbl> <dbl> <dbl>
# 1 Adventure 0.00280 0.00535 -0.00815
# 2 Fiction -0.0183 0.00942 0.00884
# 3 Sci-fi -0.0164 0.00330 0.0131
Upvotes: 1
Reputation: 20095
It's always better to work on data in long format. But if OP doesnt want to transform data in long format due to any constraint (e.g. number of columns are more which will lead to large number of rows in long format etc) then a solution in using dplyr::summarise_at
can be achieved as:
library(dplyr)
df %>% mutate(rowSum = rowSums(.[,names(df)[3:5]])) %>%
group_by(genre) %>%
summarise_at(vars(names(df)[3:5]),
funs(net = .[rating == "SUCCESS"]/rowSum[rating == "SUCCESS"] -
.[rating == "FAILURE"]/rowSum[rating == "FAILURE"] )) %>%
as.data.frame()
# genre cc_net dd_net ee_net
# 1 Adventure 0.002801373 0.005350579 -0.008151953
# 2 Fiction -0.018253467 0.009417699 0.008835768
# 3 Sci-fi -0.016415173 0.003297091 0.013118082
Data:
df <- read.table(text="
genre rating cc dd ee
Adventure FAILURE 140393 20865 358806
Adventure SUCCESS 197182 32872 492874
Fiction FAILURE 140043 14833 308602
Fiction SUCCESS 197725 28848 469879
Sci-fi FAILURE 8681 1682 24259
Sci-fi SUCCESS 7439 1647 22661",
header = TRUE, stringsAsFactors = FALSE)
Upvotes: 2
Reputation: 47350
using tidyverse
:
library(tidyverse)
df %>% gather(,,3:5) %>%
spread(rating,value) %>%
group_by(genre) %>%
transmute(key,net = SUCCESS/sum(SUCCESS) - FAILURE/sum(FAILURE)) %>%
ungroup %>%
spread(key,net)
# # A tibble: 3 x 4
# genre cc dd ee
# <chr> <dbl> <dbl> <dbl>
# 1 Adventure 0.00280 0.00535 -0.00815
# 2 Fiction -0.0183 0.00942 0.00884
# 3 Sci-fi -0.0164 0.00330 0.0131
Upvotes: 1