logjammin
logjammin

Reputation: 1211

In R, conditionally count IDs who meet an ANY condition on a certain attribute

Background

I've got a dataframe df:

> df <- data.frame(ID = c("a","a","a", "b", "c","c","c","c","d","d","d","d"),
                 treatment = c(0,1,1,0,0,0,0,0,0,0,0,1),
                 event = c(0,1,0,0,1,1,1,1,0,0,1,0),
                 stringsAsFactors=FALSE) 
> df
   ID treatment event
1   a         0     0
2   a         1     1
3   a         1     0
4   b         0     0
5   c         0     1
6   c         0     1
7   c         0     1
8   c         0     1
9   d         0     0
10  d         0     0
11  d         0     1
12  d         1     0

It's got 4 people (ID) in it: a, b, c, and d. Each ID can have one record, like b does, or several records, like the other 3 do. They each have values for treatment, which is binary, and event, which is binary too.

The problem

I want to count how many distinct IDs who have ANY treatment==1 or NO treatment==1 ever (in other words, all treatment==0) also have event==1 and event==0. Here's what I'd like the result to look like:

> [some R code]

event  treatment  n
    0          0  1           
    0          1  0
    1          0  1
    1          1  2

To break this out into English:

What I've tried

I know I'm close, but the dplyr code I'm using is getting tripped up by the fact that IDs have rows with both treatment==1 and treatment==0, and the same for event -- so R is double-counting everyone, showing me with 8 people instead of 4:

df %>% 
  group_by(event, treatment) %>%
  distinct(ID) %>%
  count(event) %>%
  as.data.frame() %>%
  mutate(Percent = round((n/sum(n))*100, digits = 1))

  event treatment n Percent
1     0         0 3    37.5
2     0         1 2    25.0
3     1         0 2    25.0
4     1         1 1    12.5

I'm not married to dplyr on this, but it's what I'm most familiar with (as opposed to data.table, for instance). The real dataset this code is going to be run on has several million rows, just FYI.

Upvotes: 0

Views: 527

Answers (2)

Onyambu
Onyambu

Reputation: 79208

In base R:

data.frame(t(table(aggregate(.~ID, df, \(x)+(sum(x)>0))[-1])))

  event treatment Freq
1     0         0    1
2     1         0    1
3     0         1    0
4     1         1    2

Upvotes: 3

Kra.P
Kra.P

Reputation: 15123

You may try

df %>%
  group_by(ID) %>%
  summarize(treatment = as.numeric(sum(treatment) > 0), 
            event = as.numeric(sum(event) > 0)) %>%
  select(-ID) %>%
  count(treatment, event)

  treatment event     n
      <dbl> <dbl> <int>
1         0     0     1
2         0     1     1
3         1     1     2

Upvotes: 2

Related Questions