Tom
Tom

Reputation: 2341

Selecting observations for which two years are available by country

I have a dataset as follows:

DT <- fread(
"ID country year Event_A Event_B
4   BEL   2002  0   1
5   BEL   2002  0   1
6   NLD   2002  1   1
7   NLD   2006  1   0
8   NLD   2006  1   1
9   GBR   2001  0   1
10  GBR   2001  0   0
11  GBR   2001  0   1
12  GBR   2007  1   1
13  GBR   2007  1   1",
header = TRUE)

I would like to keep only observations for which I have observations in two country-years. So, BEL will drop out because it only has observations in 2002.

I would like to do something like DT[,if(unique(year)>1) .SD, by=country] but that does not do anything. I also tried DT[unique(year)>1, .SD, by=country] but this gives the error:

Error in `[.data.table`(DT, unique(year) > 1, .SD, by = country) : 
  i evaluates to a logical vector length 4 but there are 10 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.

Desired output:

DT <- fread(
"ID country year Event_A Event_B
6   NLD   2002  1   1
7   NLD   2006  1   0
8   NLD   2006  1   1
9   GBR   2001  0   1
10  GBR   2001  0   0
11  GBR   2001  0   1
12  GBR   2007  1   1
13  GBR   2007  1   1",
header = TRUE)

Upvotes: 0

Views: 49

Answers (4)

user2474226
user2474226

Reputation: 1502

If it's not necessary to do it in data.table, you can count the number of distinct years by country via base R:

country_count <- aggregate(year ~ country, DT, FUN = function(x) NROW(unique(x)))
DT[DT$country %in% country_count$country[country_count$year > 1],]
# output
   ID country year Event_A Event_B
3   6     NLD 2002       1       1
4   7     NLD 2006       1       0
5   8     NLD 2006       1       1
6   9     GBR 2001       0       1
7  10     GBR 2001       0       0
8  11     GBR 2001       0       1
9  12     GBR 2007       1       1
10 13     GBR 2007       1       1

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389055

You can use uniqueN to get count of unique values and select rows using .SD.

library(data.table)
DT[, .SD[uniqueN(year) > 1], country]

#   country ID year Event_A Event_B
#1:     NLD  6 2002       1       1
#2:     NLD  7 2006       1       0
#3:     NLD  8 2006       1       1
#4:     GBR  9 2001       0       1
#5:     GBR 10 2001       0       0
#6:     GBR 11 2001       0       1
#7:     GBR 12 2007       1       1
#8:     GBR 13 2007       1       1

Or in dplyr we can do the same with n_distinct and filter

library(dplyr)
DT %>% group_by(country) %>% filter(n_distinct(year) > 1)

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 101848

Here is a base R solution by using ave() and subset()

DTout <- subset(DT, as.logical(ave(DT$year,DT$country, FUN = function(x) length(unique(x))>=2)))

such that

> DTout
   ID country year Event_A Event_B
3   6     NLD 2002       1       1
4   7     NLD 2006       1       0
5   8     NLD 2006       1       1
6   9     GBR 2001       0       1
7  10     GBR 2001       0       0
8  11     GBR 2001       0       1
9  12     GBR 2007       1       1
10 13     GBR 2007       1       1

Upvotes: 1

cimentadaj
cimentadaj

Reputation: 1488

In the same spirit as @user2474226, if you're open to other packages, a simple dplyrsolution:

 library(data.table)
 library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#> 
#>     between, first, last
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

  DT <- fread(
    "ID country year Event_A Event_B
4   BEL   2002  0   1
5   BEL   2002  0   1
6   NLD   2002  1   1
7   NLD   2006  1   0
8   NLD   2006  1   1
9   GBR   2001  0   1
10  GBR   2001  0   0
11  GBR   2001  0   1
12  GBR   2007  1   1
13  GBR   2007  1   1",
header = TRUE)

  # Detect count of countries
  sel_cnt <-
    DT %>%
    count(country, year) %>%
    count(country) %>%
    filter(n > 1)


  DT %>%
    semi_join(sel_cnt, by = "country")
#>   ID country year Event_A Event_B
#> 1  6     NLD 2002       1       1
#> 2  7     NLD 2006       1       0
#> 3  8     NLD 2006       1       1
#> 4  9     GBR 2001       0       1
#> 5 10     GBR 2001       0       0
#> 6 11     GBR 2001       0       1
#> 7 12     GBR 2007       1       1
#> 8 13     GBR 2007       1       1

Upvotes: 1

Related Questions