codemachino
codemachino

Reputation: 103

Constraint on panel data to remove subjects using data.table

I have a panel dataset:

data <- data.table(ID = c(1,1,1,1,2,2,3,3,3),
                   year = c(1,2,3,4,1,2,1,2,3),
                   score1 = c(90,78,92,69,86,73,82,85,91))

> data
   ID year score1
1:  1    1     90
2:  1    2     78
3:  1    3     92
4:  1    4     69
5:  2    1     86
6:  2    2     73
7:  3    1     82
8:  3    2     85
9:  3    3     91

I want to place a constraint such that every ID should have at least 3 years of observations. I tried using the following using the data.table package:

data[data$year >= 3, ]

However this ignores the observations in years 1 & 2 for ID's which have at least 3 years of observations. In other words, I want to only look at ID's which have at least 3 years of observations and also include the year 1 & 2 observations.

The expected output is thus:

   ID year score1
1:  1    1     90
2:  1    2     78
3:  1    3     92
4:  1    4     69
5:  3    1     82
6:  3    2     85
7:  3    3     91

Upvotes: 2

Views: 45

Answers (2)

ThomasIsCoding
ThomasIsCoding

Reputation: 102920

Another data.table option

> data[, .SD[uniqueN(year) >= 3], ID]
   ID year score1
1:  1    1     90
2:  1    2     78
3:  1    3     92
4:  1    4     69
5:  3    1     82
6:  3    2     85
7:  3    3     91

Upvotes: 1

akrun
akrun

Reputation: 887981

We may use a group by approach

library(dplyr)
data %>%
    group_by(ID) %>% 
    filter(n_distinct(year) >= 3) %>%
    ungroup

-output

# A tibble: 7 x 3
     ID  year score1
  <dbl> <dbl>  <dbl>
1     1     1     90
2     1     2     78
3     1     3     92
4     1     4     69
5     3     1     82
6     3     2     85
7     3     3     91

Or using data.table

library(data.table)
data <- setDT(data)[data[, .I[uniqueN(year) >=3], by = ID]$V1]

-output

data
    ID year score1
1:  1    1     90
2:  1    2     78
3:  1    3     92
4:  1    4     69
5:  3    1     82
6:  3    2     85
7:  3    3     91

Upvotes: 2

Related Questions