Reputation: 37
I'm working with a large data frame of hospitalization records. Many patients have two or more hospitalizations, and their past medical history may be incomplete at one or more of the hospitalizations. I'd like to collapse all the information from each of their hospitalizations into a single list of medical problems for each patient.
Here's a sample data frame:
id <- c("123","456","789","101","123","587","456","789")
HTN <- c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE)
DM2 <- c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE)
TIA <- c(TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE)
df <- data.frame(id,HTN,DM2,TIA)
df
Which comes out to:
> df
id HTN DM2 TIA
1 123 TRUE FALSE TRUE
2 456 FALSE FALSE TRUE
3 789 FALSE TRUE TRUE
4 101 FALSE TRUE TRUE
5 123 FALSE FALSE FALSE
6 587 TRUE TRUE TRUE
7 456 FALSE FALSE TRUE
8 789 FALSE TRUE TRUE
I'd like my output to look like this:
id <- c("101","123","456","587","789")
HTN <- c(FALSE,TRUE,FALSE,TRUE,FALSE)
DM2 <- c(TRUE,FALSE,FALSE,TRUE,TRUE)
TIA <- c(TRUE,TRUE,TRUE,TRUE,TRUE)
df2 <- data.frame(id,HTN,DM2,TIA)
df2
id HTN DM2 TIA
1 101 FALSE TRUE TRUE
2 123 TRUE FALSE TRUE
3 456 FALSE FALSE TRUE
4 587 TRUE TRUE TRUE
5 789 FALSE TRUE TRUE
So far I've got a pretty good hunch that arranging and grouping my data is the right place to start, and I think I could make it work by creating a new variable for each medical problem. I have about 30 medical problem's I'll need to collapse in this way, though, and that much repetitive code just seems like a recipe for an occult error.
df3 <- df %>%
arrange(id) %>%
group_by(id)
Looking around I haven't been able to find a particularly elegant way to go about this. Is there some slick dplyr function I'm overlooking?
Upvotes: 2
Views: 613
Reputation: 33753
If we prefer data.table
we might use:
setDT(df)[, lapply(.SD, any), keyby = id]
id HTN DM2 TIA
1: 101 FALSE TRUE TRUE
2: 123 TRUE FALSE TRUE
3: 456 FALSE FALSE TRUE
4: 587 TRUE TRUE TRUE
5: 789 FALSE TRUE TRUE
Upvotes: 0
Reputation: 887991
A base R option would be
aggregate(.~ id, df, any)
# id HTN DM2 TIA
#1 101 FALSE TRUE TRUE
#2 123 TRUE FALSE TRUE
#3 456 FALSE FALSE TRUE
#4 587 TRUE TRUE TRUE
#5 789 FALSE TRUE TRUE
Or with rowsum
rowsum(+(df[-1]), group = df$id) > 0
Upvotes: 1
Reputation: 48251
We may use
df %>% group_by(id) %>% summarize_all(any)
# A tibble: 5 x 4
# id HTN DM2 TIA
# <fct> <lgl> <lgl> <lgl>
# 1 101 FALSE TRUE TRUE
# 2 123 TRUE FALSE TRUE
# 3 456 FALSE FALSE TRUE
# 4 587 TRUE TRUE TRUE
# 5 789 FALSE TRUE TRUE
In this way we first indeed group by id
, as you suggested. Then we summarize
all
the variables with a function any
: we provide a logical vector (e.g., HTN
for patient 101
) and return TRUE
if in any of the rows we have TRUE
and FALSE
otherwise.
Upvotes: 2