brianerly
brianerly

Reputation: 37

R - Collapse Data by Grouped Row Observations

I'm working with a large data frame of hospitalization records. Many patients have two or more hospitalizations, and their past medical history may be incomplete at one or more of the hospitalizations. I'd like to collapse all the information from each of their hospitalizations into a single list of medical problems for each patient.

Here's a sample data frame:

 id <- c("123","456","789","101","123","587","456","789")
 HTN <- c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, 
 FALSE)
 DM2 <- c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE)
 TIA <- c(TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE)
 df <- data.frame(id,HTN,DM2,TIA)
 df

Which comes out to:

 > df
    id   HTN   DM2   TIA
 1 123  TRUE FALSE  TRUE
 2 456 FALSE FALSE  TRUE
 3 789 FALSE  TRUE  TRUE
 4 101 FALSE  TRUE  TRUE
 5 123 FALSE FALSE FALSE
 6 587  TRUE  TRUE  TRUE
 7 456 FALSE FALSE  TRUE
 8 789 FALSE  TRUE  TRUE

I'd like my output to look like this:

  id <- c("101","123","456","587","789")
  HTN <- c(FALSE,TRUE,FALSE,TRUE,FALSE)
  DM2 <- c(TRUE,FALSE,FALSE,TRUE,TRUE)
  TIA <- c(TRUE,TRUE,TRUE,TRUE,TRUE)
  df2 <- data.frame(id,HTN,DM2,TIA)
  df2

    id  HTN  DM2 TIA
 1 101 FALSE  TRUE TRUE
 2 123  TRUE FALSE TRUE
 3 456 FALSE FALSE TRUE
 4 587  TRUE  TRUE TRUE
 5 789 FALSE  TRUE TRUE

So far I've got a pretty good hunch that arranging and grouping my data is the right place to start, and I think I could make it work by creating a new variable for each medical problem. I have about 30 medical problem's I'll need to collapse in this way, though, and that much repetitive code just seems like a recipe for an occult error.

 df3 <- df %>%
   arrange(id) %>%
   group_by(id)

Looking around I haven't been able to find a particularly elegant way to go about this. Is there some slick dplyr function I'm overlooking?

Upvotes: 2

Views: 613

Answers (3)

s_baldur
s_baldur

Reputation: 33753

If we prefer data.table we might use:

setDT(df)[, lapply(.SD, any), keyby = id]
    id   HTN   DM2  TIA
1: 101 FALSE  TRUE TRUE
2: 123  TRUE FALSE TRUE
3: 456 FALSE FALSE TRUE
4: 587  TRUE  TRUE TRUE
5: 789 FALSE  TRUE TRUE

Upvotes: 0

akrun
akrun

Reputation: 887991

A base R option would be

aggregate(.~ id, df, any)
#   id   HTN   DM2  TIA
#1 101 FALSE  TRUE TRUE
#2 123  TRUE FALSE TRUE
#3 456 FALSE FALSE TRUE
#4 587  TRUE  TRUE TRUE
#5 789 FALSE  TRUE TRUE

Or with rowsum

rowsum(+(df[-1]), group = df$id) > 0

Upvotes: 1

Julius Vainora
Julius Vainora

Reputation: 48251

We may use

df %>% group_by(id) %>% summarize_all(any)
# A tibble: 5 x 4
#   id    HTN   DM2   TIA  
#   <fct> <lgl> <lgl> <lgl>
# 1 101   FALSE TRUE  TRUE 
# 2 123   TRUE  FALSE TRUE 
# 3 456   FALSE FALSE TRUE 
# 4 587   TRUE  TRUE  TRUE 
# 5 789   FALSE TRUE  TRUE

In this way we first indeed group by id, as you suggested. Then we summarize all the variables with a function any: we provide a logical vector (e.g., HTN for patient 101) and return TRUE if in any of the rows we have TRUE and FALSE otherwise.

Upvotes: 2

Related Questions