rsteward
rsteward

Reputation: 51

Accessing and using values made while grouping data

I initially had an ungrouped dataset which I will put below. I grouped the data but now want to use the values for number of yes and no's for each category to find a proportion of deaths/(deaths and survived) to create a logistic regression model.

This is what the ungrouped data set looks like:

 Occupation Education Age Died
1  household Secondary  39   no
2    farming   primary  83  yes
3    farming   primary  60  yes
4    farming   primary  73  yes
5    farming Secondary  51   no
6    farming iliterate  62  yes

then grouping the data (on occupation) with the following:

occu %>% group_by(Occupation, Died) %>% count()

which displays:

Occupation       Died      n
   <fct>            <fct> <int>
 1 business/service no       12
 2 business/service yes       9
 3 farming          no      939
 4 farming          yes    1093
 5 household        no      154
 6 household        yes      94
 7 others           yes       3
 8 others/unknown   no      146
 9 others/unknown   yes      10

so now I would like to count the number of yes's and no in order to obtain a proportion on which to run a logistic regression model for the now grouped data

Upvotes: 1

Views: 62

Answers (3)

Rushabh Patel
Rushabh Patel

Reputation: 2764

Using dplyr

dt %>% 
  group_by(Occupation, Died) %>% 
  count() %>% 
  ungroup() %>%
  group_by(Occupation) %>% 
  mutate(counts=sum(n),prop=n/sum(n))

You could also use data.table

library(data.table)
dt <- setDT(dt)[,.(.N),by=.(Died,Occupation)]
setDT(dt)[, `:=` (prop = N/sum(N)*100,total=sum(N)), by = .(Occupation)]

Sample Input Data:

dt <- read.table(text="Occupation Education Age Died
  household Secondary  39   no
    farming   primary  83  yes
    farming   primary  60  yes
    farming   primary  73  yes
    farming Secondary  51   no
    farming iliterate  62  yes",header=T)

Note: I haven't tested it on your complete data, but I have used above input data.

Upvotes: 2

Amit Kohli
Amit Kohli

Reputation: 2950

Welcome to the community! Good question! It would be easier for me to help if you had also included the output of dput(occu), that way I could use your data, but you put enough for us to help. Anyway, It looks like a good option for tidyr::spread:

occu_temp <- occu %>% group_by(Occupation, Died) %>% count() %>% spread(Died, n)

This will get us into columns. Then we could get the total, and then, finally, the ratio:

occu_temp %>% mutate(total = yes + no) %>% mutate(ProportionYes = yes/total)

and that should work.

Upvotes: 1

kstew
kstew

Reputation: 1114

Using dplyr...

dt <- read.table(text="Occupation       Died      n

  business/service no       12
  business/service yes       9
  farming          no      939
  farming          yes    1093
  household        no      154
  household        yes      94
  others           yes       3
  others/unknown   no      146
  others/unknown   yes      10",header=T)

dt %>% group_by(Occupation) %>% 
  mutate(total=sum(n), prop=n/total)

# A tibble: 9 x 5
# Groups:   Occupation [5]
  Occupation       Died      n total   prop
  <fct>            <fct> <int> <int>  <dbl>
1 business/service no       12    21 0.571 
2 business/service yes       9    21 0.429 
3 farming          no      939  2032 0.462 
4 farming          yes    1093  2032 0.538 
5 household        no      154   248 0.621 
6 household        yes      94   248 0.379 
7 others           yes       3     3 1     
8 others/unknown   no      146   156 0.936 
9 others/unknown   yes      10   156 0.0641

Upvotes: 1

Related Questions