JRUK
JRUK

Reputation: 93

Getting net values as a proportion from a dataframe in R

I have a dataframe in R (p2.df) that has aggregated a range of values into the following (there are many more columns this is just an abridge version):

genre       rating  cc      dd      ee
Adventure   FAILURE 140393  20865   358806
Adventure   SUCCESS 197182  32872   492874
Fiction     FAILURE 140043  14833   308602
Fiction     SUCCESS 197725  28848   469879
Sci-fi      FAILURE 8681    1682    24259
Sci-fi      SUCCESS 7439    1647    22661

I want to get the net values of the proportions for each column, which I can get in a spreadsheet but can't in R studio.

The formula in the spreadsheet follows the pattern:

net_cc = (cc(success)/(cc(success)+dd(success)+ee(success)) - (cc(fail)/(cc(fail)+dd(fail)+ee(fail))

What I want to get out in R is this table that I can get from the spreadsheet:

genre       net_cc          net_dd          net_ee
Adventure   0.002801373059  0.005350579467  -0.008151952526
Fiction     -0.01825346696  0.009417699223  0.008835767735
Sci-fi      -0.01641517271  0.003297091109  0.0131180816

Any ideas how? If it's any use I created the p2.df by summarising a previous table as:

library(dplyr)

p2.df<- s2.df %>% group_by(genre,rating) %>% summarise_all(sum)  

Upvotes: 2

Views: 122

Answers (3)

utubun
utubun

Reputation: 4505

My answer is very close to @MKR answer, however, I just wish to point out, that we can make use of decoded rating (SUCESS = 1 and FAILURE = -1`) variable to avoid subsetting in the last part:

df %>% 
  mutate(rating = (rating == "SUCCESS")*2 - 1, denom = rowSums(.[3:5])) %>%
  group_by(genre) %>%
  summarise_at(vars(cc:ee), funs(sum(rating * . / denom)))

   #   A tibble: 3 x 4
   #   genre           cc      dd       ee
   #   <chr>        <dbl>   <dbl>    <dbl>
   # 1 Adventure  0.00280 0.00535 -0.00815
   # 2 Fiction   -0.0183  0.00942  0.00884
   # 3 Sci-fi    -0.0164  0.00330  0.0131 

Upvotes: 1

MKR
MKR

Reputation: 20095

It's always better to work on data in long format. But if OP doesnt want to transform data in long format due to any constraint (e.g. number of columns are more which will lead to large number of rows in long format etc) then a solution in using dplyr::summarise_at can be achieved as:

library(dplyr)

df %>% mutate(rowSum = rowSums(.[,names(df)[3:5]])) %>%
  group_by(genre) %>%
  summarise_at(vars(names(df)[3:5]),
              funs(net = .[rating == "SUCCESS"]/rowSum[rating == "SUCCESS"] - 
                         .[rating == "FAILURE"]/rowSum[rating == "FAILURE"] )) %>%
  as.data.frame()

#       genre       cc_net      dd_net       ee_net
# 1 Adventure  0.002801373 0.005350579 -0.008151953
# 2   Fiction -0.018253467 0.009417699  0.008835768
# 3    Sci-fi -0.016415173 0.003297091  0.013118082

Data:

df <- read.table(text="
genre       rating  cc      dd      ee
Adventure   FAILURE 140393  20865   358806
Adventure   SUCCESS 197182  32872   492874
Fiction     FAILURE 140043  14833   308602
Fiction     SUCCESS 197725  28848   469879
Sci-fi      FAILURE 8681    1682    24259
Sci-fi      SUCCESS 7439    1647    22661",
header = TRUE, stringsAsFactors = FALSE)

Upvotes: 2

moodymudskipper
moodymudskipper

Reputation: 47350

using tidyverse:

library(tidyverse)
df %>% gather(,,3:5) %>%
  spread(rating,value) %>%
  group_by(genre) %>%
  transmute(key,net = SUCCESS/sum(SUCCESS) - FAILURE/sum(FAILURE)) %>%
  ungroup %>%
  spread(key,net)

# # A tibble: 3 x 4
# genre           cc      dd       ee
#   <chr>        <dbl>   <dbl>    <dbl>
# 1 Adventure  0.00280 0.00535 -0.00815
# 2 Fiction   -0.0183  0.00942  0.00884
# 3 Sci-fi    -0.0164  0.00330  0.0131 

Upvotes: 1

Related Questions