Reputation: 71
I have a dataset that describes a sample of people and the number and type of diseases they have. Here, 1 means that the person has the disease and 0 means that the person does not have the disease. NA denotes missing values. It looks something like this:
library(tidyverse)
df <- tribble(
~Heart_disease, ~Lung_disease, ~Bowel_disease, ~Nerve_disease, ~Liver_disease
, 0, 1, 0, 1, 0
, NA, 0, 0, 0, 0
, 1, 1, 1, 1, 0
, 0, 1, 0, 0, 1
, 1, 0, 0, 1, 0
, 0, 0, 1, NA, NA
, 1, 0, 0, 0, 0
, 0, 0, 1, 0, 1
, 0, 0, 0, 0, 0
, 0, 1, 1, 1, 1
)
Heart_disease Lung_disease Bowel_disease Nerve_disease Liver_disease
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 1 0 1 0
2 NA 0 0 0 0
3 1 1 1 1 0
4 0 1 0 0 1
5 1 0 0 1 0
6 0 0 1 NA NA
7 1 0 0 0 0
8 0 0 1 0 1
9 0 0 0 0 0
10 0 1 1 1 1
I would like to know: a) How many people have two diseases? b) How many people have three or more diseases?
How could I calculate this using R?
Many thanks for your help
Upvotes: 0
Views: 119
Reputation: 23574
Here is one way. I think each row number (row name) represents a person. You want to get the sum of row with rowSums()
. One you have that, you can aggregate the data. I counted how many rows have 2 in the column, total
. I did the similar for the other condition.
library(dplyr)
mutate(mydf, total = rowSums(mydf, na.rm = T)) %>%
summarize(two = sum(total == 2), morethan3 = sum(total >= 3))
# two morethan3
#1 4 2
DATA
mydf <- structure(list(Heart_disease = c(0L, NA, 1L, 0L, 1L, 0L, 1L,
0L, 0L, 0L), Lung_disease = c(1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L,
0L, 1L), Bowel_disease = c(0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L,
1L), Nerve_disease = c(1L, 0L, 1L, 0L, 1L, NA, 0L, 0L, 0L, 1L
), Liver_disease = c(0L, 0L, 0L, 1L, 0L, NA, 0L, 1L, 0L, 1L)), class =
"data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
Upvotes: 1
Reputation: 8127
So, this is the dplyr
/ tidyverse
solution:
library(tidyverse)
df <- tribble(
~Heart_disease, ~Lung_disease, ~Bowel_disease, ~Nerve_disease, ~Liver_disease
, 0, 1, 0, 1, 0
, NA, 0, 0, 0, 0
, 1, 1, 1, 1, 0
, 0, 1, 0, 0, 1
, 1, 0, 0, 1, 0
, 0, 0, 1, NA, NA
, 1, 0, 0, 0, 0
, 0, 0, 1, 0, 1
, 0, 0, 0, 0, 0
, 0, 1, 1, 1, 1
)
df %>%
mutate(patientID = 1:nrow(.)) %>%
gather("disease", "occured", -patientID) %>%
group_by(patientID) %>%
summarise(nrDiseases = sum(occured, na.rm = TRUE)) %>%
arrange(nrDiseases) %>%
group_by(nrDiseases) %>%
summarise(howManyPeople = n())
nrDiseases howManyPeople
<dbl> <int>
1 0 2
2 1 2
3 2 4
4 4 2
If it is unclear, how this works:
%>%
is to be read as "then". Try to run only parts of the code, to see the intermediate results, e.g. this part
df %>%
mutate(patientID = 1:nrow(.)) %>%
gather("disease", "occured", -patientID) %>%
group_by(patientID) %>%
summarise(nrDiseases = sum(occured, na.rm = TRUE))
will give you this
patientID nrDiseases
<int> <dbl>
1 1 2
2 2 0
3 3 4
4 4 2
5 5 2
6 6 1
7 7 1
8 8 2
9 9 0
10 10 4
Upvotes: 0