Reputation: 6726
The data table I am working with is like
require(data.table)
set.seed(2)
dt <- data.table(user=c(rep('a', 3), rep('b', 2), rep('c', 4)),
type=c(sample(LETTERS[1:4], 3),
sample(LETTERS[1:4], 2),
sample(LETTERS[1:4], 4))
)
which is
user type
1: a A
2: a C
3: a B
4: b A
5: b C
6: c D
7: c A
8: c B
9: c C
I'd like to find specific users of types in either A
, B
or C
only. In the example above, user c
is not qualified because he has D
in one of his type records. So the desired output should be
user type
1: a A
2: a C
3: a B
4: b A
5: b C
The first method, which is obvious inefficient, comes to my mind is split dt
with split(dt, dt$user)
and check if grepl
and nrow
are the same, followed by indexing and rbindlist
. Since the data table I am actually working with has 10989251 rows, an efficient way to do the subsetting is necessary.
Upvotes: 1
Views: 203
Reputation: 4340
Some might find this way less elegant, but it might be faster: find users who have type D, then exclude them. Could skip the unique if there are no or few duplicate user-type pairs.
badusers = dt[type=='D',unique(user)];
dt.ABCs = dt[!user %in% badusers,];
Upvotes: 0
Reputation: 887691
Grouped by 'user', if
all
the elements in 'type' have only the first three LETTERS
, we get the Subset of Data.table (.SD
). Here I am using %chin%
for vector comparison as it is a faster version of %in%
optimized for character
vectors.
dt[, if(all(type %chin% LETTERS[1:3])) .SD, by = user]
# user type
#1: a A
#2: a C
#3: a B
#4: b A
#5: b C
Upvotes: 4
Reputation: 31181
With a classic filter selection:
dt[unlist(by(type, user, function(x) !!cumprod(x %in% LETTERS[1:3]))),]
# user type
#1: a A
#2: a C
#3: a B
#4: b A
#5: b C
Upvotes: 1