Reputation: 2341
I have a dataset as follows:
DT <- fread(
"ID country year Event_A Event_B
4 BEL 2002 0 1
5 BEL 2002 0 1
6 NLD 2002 1 1
7 NLD 2006 1 0
8 NLD 2006 1 1
9 GBR 2001 0 1
10 GBR 2001 0 0
11 GBR 2001 0 1
12 GBR 2007 1 1
13 GBR 2007 1 1",
header = TRUE)
I would like to keep only observations for which I have observations in two country-years. So, BEL
will drop out because it only has observations in 2002.
I would like to do something like DT[,if(unique(year)>1) .SD, by=country]
but that does not do anything. I also tried DT[unique(year)>1, .SD, by=country]
but this gives the error:
Error in `[.data.table`(DT, unique(year) > 1, .SD, by = country) :
i evaluates to a logical vector length 4 but there are 10 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
Desired output:
DT <- fread(
"ID country year Event_A Event_B
6 NLD 2002 1 1
7 NLD 2006 1 0
8 NLD 2006 1 1
9 GBR 2001 0 1
10 GBR 2001 0 0
11 GBR 2001 0 1
12 GBR 2007 1 1
13 GBR 2007 1 1",
header = TRUE)
Upvotes: 0
Views: 49
Reputation: 1502
If it's not necessary to do it in data.table, you can count the number of distinct years by country via base R:
country_count <- aggregate(year ~ country, DT, FUN = function(x) NROW(unique(x)))
DT[DT$country %in% country_count$country[country_count$year > 1],]
# output
ID country year Event_A Event_B
3 6 NLD 2002 1 1
4 7 NLD 2006 1 0
5 8 NLD 2006 1 1
6 9 GBR 2001 0 1
7 10 GBR 2001 0 0
8 11 GBR 2001 0 1
9 12 GBR 2007 1 1
10 13 GBR 2007 1 1
Upvotes: 1
Reputation: 389055
You can use uniqueN
to get count of unique values and select rows using .SD
.
library(data.table)
DT[, .SD[uniqueN(year) > 1], country]
# country ID year Event_A Event_B
#1: NLD 6 2002 1 1
#2: NLD 7 2006 1 0
#3: NLD 8 2006 1 1
#4: GBR 9 2001 0 1
#5: GBR 10 2001 0 0
#6: GBR 11 2001 0 1
#7: GBR 12 2007 1 1
#8: GBR 13 2007 1 1
Or in dplyr
we can do the same with n_distinct
and filter
library(dplyr)
DT %>% group_by(country) %>% filter(n_distinct(year) > 1)
Upvotes: 1
Reputation: 101848
Here is a base R solution by using ave()
and subset()
DTout <- subset(DT, as.logical(ave(DT$year,DT$country, FUN = function(x) length(unique(x))>=2)))
such that
> DTout
ID country year Event_A Event_B
3 6 NLD 2002 1 1
4 7 NLD 2006 1 0
5 8 NLD 2006 1 1
6 9 GBR 2001 0 1
7 10 GBR 2001 0 0
8 11 GBR 2001 0 1
9 12 GBR 2007 1 1
10 13 GBR 2007 1 1
Upvotes: 1
Reputation: 1488
In the same spirit as @user2474226
, if you're open to other packages, a simple dplyr
solution:
library(data.table)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
DT <- fread(
"ID country year Event_A Event_B
4 BEL 2002 0 1
5 BEL 2002 0 1
6 NLD 2002 1 1
7 NLD 2006 1 0
8 NLD 2006 1 1
9 GBR 2001 0 1
10 GBR 2001 0 0
11 GBR 2001 0 1
12 GBR 2007 1 1
13 GBR 2007 1 1",
header = TRUE)
# Detect count of countries
sel_cnt <-
DT %>%
count(country, year) %>%
count(country) %>%
filter(n > 1)
DT %>%
semi_join(sel_cnt, by = "country")
#> ID country year Event_A Event_B
#> 1 6 NLD 2002 1 1
#> 2 7 NLD 2006 1 0
#> 3 8 NLD 2006 1 1
#> 4 9 GBR 2001 0 1
#> 5 10 GBR 2001 0 0
#> 6 11 GBR 2001 0 1
#> 7 12 GBR 2007 1 1
#> 8 13 GBR 2007 1 1
Upvotes: 1