Reputation: 51
I initially had an ungrouped dataset which I will put below. I grouped the data but now want to use the values for number of yes and no's for each category to find a proportion of deaths/(deaths and survived) to create a logistic regression model.
This is what the ungrouped data set looks like:
Occupation Education Age Died
1 household Secondary 39 no
2 farming primary 83 yes
3 farming primary 60 yes
4 farming primary 73 yes
5 farming Secondary 51 no
6 farming iliterate 62 yes
then grouping the data (on occupation) with the following:
occu %>% group_by(Occupation, Died) %>% count()
which displays:
Occupation Died n
<fct> <fct> <int>
1 business/service no 12
2 business/service yes 9
3 farming no 939
4 farming yes 1093
5 household no 154
6 household yes 94
7 others yes 3
8 others/unknown no 146
9 others/unknown yes 10
so now I would like to count the number of yes's and no in order to obtain a proportion on which to run a logistic regression model for the now grouped data
Upvotes: 1
Views: 62
Reputation: 2764
Using dplyr
dt %>%
group_by(Occupation, Died) %>%
count() %>%
ungroup() %>%
group_by(Occupation) %>%
mutate(counts=sum(n),prop=n/sum(n))
You could also use data.table
library(data.table)
dt <- setDT(dt)[,.(.N),by=.(Died,Occupation)]
setDT(dt)[, `:=` (prop = N/sum(N)*100,total=sum(N)), by = .(Occupation)]
Sample Input Data:
dt <- read.table(text="Occupation Education Age Died
household Secondary 39 no
farming primary 83 yes
farming primary 60 yes
farming primary 73 yes
farming Secondary 51 no
farming iliterate 62 yes",header=T)
Note: I haven't tested it on your complete data, but I have used above input data.
Upvotes: 2
Reputation: 2950
Welcome to the community! Good question! It would be easier for me to help if you had also included the output of dput(occu)
, that way I could use your data, but you put enough for us to help. Anyway, It looks like a good option for tidyr::spread
:
occu_temp <- occu %>% group_by(Occupation, Died) %>% count() %>% spread(Died, n)
This will get us into columns. Then we could get the total, and then, finally, the ratio:
occu_temp %>% mutate(total = yes + no) %>% mutate(ProportionYes = yes/total)
and that should work.
Upvotes: 1
Reputation: 1114
Using dplyr
...
dt <- read.table(text="Occupation Died n
business/service no 12
business/service yes 9
farming no 939
farming yes 1093
household no 154
household yes 94
others yes 3
others/unknown no 146
others/unknown yes 10",header=T)
dt %>% group_by(Occupation) %>%
mutate(total=sum(n), prop=n/total)
# A tibble: 9 x 5
# Groups: Occupation [5]
Occupation Died n total prop
<fct> <fct> <int> <int> <dbl>
1 business/service no 12 21 0.571
2 business/service yes 9 21 0.429
3 farming no 939 2032 0.462
4 farming yes 1093 2032 0.538
5 household no 154 248 0.621
6 household yes 94 248 0.379
7 others yes 3 3 1
8 others/unknown no 146 156 0.936
9 others/unknown yes 10 156 0.0641
Upvotes: 1