Reputation: 6726

R: Fast Subsetting Large Data Table Conditioned on Key Words in One of the Columns

The data table I am working with is like

require(data.table)
set.seed(2)
dt <- data.table(user=c(rep('a', 3), rep('b', 2), rep('c', 4)),
                 type=c(sample(LETTERS[1:4], 3), 
                        sample(LETTERS[1:4], 2),
                        sample(LETTERS[1:4], 4))
                 )

which is

   user type
1:    a    A
2:    a    C
3:    a    B
4:    b    A
5:    b    C
6:    c    D
7:    c    A
8:    c    B
9:    c    C

I'd like to find specific users of types in either A, B or C only. In the example above, user c is not qualified because he has D in one of his type records. So the desired output should be

   user type
1:    a    A
2:    a    C
3:    a    B
4:    b    A
5:    b    C

The first method, which is obvious inefficient, comes to my mind is split dt with split(dt, dt$user) and check if grepl and nrow are the same, followed by indexing and rbindlist. Since the data table I am actually working with has 10989251 rows, an efficient way to do the subsetting is necessary.

Upvotes: 1

Answers (3)

webb

Reputation: 4340

Some might find this way less elegant, but it might be faster: find users who have type D, then exclude them. Could skip the unique if there are no or few duplicate user-type pairs.

badusers = dt[type=='D',unique(user)];
dt.ABCs = dt[!user %in% badusers,];

Upvotes: 0

akrun

Reputation: 887691

Grouped by 'user', if all the elements in 'type' have only the first three LETTERS, we get the Subset of Data.table (.SD). Here I am using %chin% for vector comparison as it is a faster version of %in% optimized for character vectors.

dt[, if(all(type %chin% LETTERS[1:3])) .SD, by = user]
#    user type
#1:    a    A
#2:    a    C
#3:    a    B
#4:    b    A
#5:    b    C

Upvotes: 4

Colonel Beauvel

Reputation: 31181

With a classic filter selection:

dt[unlist(by(type, user, function(x) !!cumprod(x %in% LETTERS[1:3]))),]
#   user type
#1:    a    A
#2:    a    C
#3:    a    B
#4:    b    A
#5:    b    C

Upvotes: 1

R: Fast Subsetting Large Data Table Conditioned on Key Words in One of the Columns

Answers (3)

Related Questions