Reputation: 9018
I am dealing with a data frame with 3 million rows and 10 columns and I am doing some subsetting on it. I have some toy code below. When I subset it takes a long time. If I use data.table and subset on data.table will that be faster? Here is some toy code:
s<-c(100,100,100,800,800,6662,33565,265653262,266532)
p<-c(5,5,5,10,10,10,8,9,10)
name<-c("bob","bob","bob","ed","ed","ed","joe","frank","ted")
time<- as.POSIXct(as.character(c("2014-10-27 18:11:36 PDT","2014-10-27 18:11:37 PDT","2014-10-27 18:11:38 PDT","2014-10-27 18:11:39 PDT","2014-10-27 18:11:40 PDT","2014-10-27 18:11:41 PDT","2014-10-27 19:11:36 PDT","2014-10-27 20:11:36 PDT","2014-10-27 21:11:36 PDT")))
dat<- data.frame(s,p,name,time)
dat
here is the data frame:
s p name time
1 100 5 bob 2014-10-27 18:11:36
2 100 5 bob 2014-10-27 18:11:37
3 100 5 bob 2014-10-27 18:11:38
4 800 10 ed 2014-10-27 18:11:39
5 800 10 ed 2014-10-27 18:11:40
6 6662 10 ed 2014-10-27 18:11:41
7 33565 8 joe 2014-10-27 19:11:36
8 265653262 9 frank 2014-10-27 20:11:36
9 266532 10 ted 2014-10-27 21:11:36
now I subset on the dataframe:
result <- subset(dat, as.numeric(s) == 100
& p == 5
& name == "bob"
& time >= "2014-10-27 18:11:36 PDT"
& time <= "2014-10-27 18:12:00 PDT"
)
result
s p name time
1 100 5 bob 2014-10-27 18:11:36
2 100 5 bob 2014-10-27 18:11:37
3 100 5 bob 2014-10-27 18:11:38
How can I do something similar using data.table?
Thank you.
Upvotes: 2
Views: 94
Reputation: 3304
Well, your example code actually break for data frames thanks to the "time" selectors - you're trying to match POSIXlt dates (in the data frame) with character strings (in the selector). I think you want:
result <- subset(dat, as.numeric(s) == 100
& p == 5
& name == "bob"
& time >= as.POSIXlt("2014-10-27 18:11:36 PDT")
& time <= as.POSIXlt("2014-10-27 18:12:00 PDT")
)
result
s p name time
1 100 5 bob 2014-10-27 18:11:36
2 100 5 bob 2014-10-27 18:11:37
3 100 5 bob 2014-10-27 18:11:38
This syntax works perfectly well for data.tables:
dat <- as.data.table(dat)
result <- subset(dat,
as.numeric(s) == 100
& p == 5
& name == "bob"
& time >= as.POSIXlt("2014-10-27 18:11:36 PDT")
& time <= as.POSIXlt("2014-10-27 18:12:00 PDT")
)
result
s p name time
1: 100 5 bob 2014-10-27 18:11:36
2: 100 5 bob 2014-10-27 18:11:37
3: 100 5 bob 2014-10-27 18:11:38
If you want something more data.table-like, you can avoid "subset" entirely and instead just operate on the data.table directly:
dat <- as.data.table(dat)
result <- dat[as.numeric(s) == 100
& p == 5
& name == "bob"
& time >= as.POSIXlt("2014-10-27 18:11:36 PDT")
& time <= as.POSIXlt("2014-10-27 18:12:00 PDT"),]
result
s p name time
1: 100 5 bob 2014-10-27 18:11:36
2: 100 5 bob 2014-10-27 18:11:37
3: 100 5 bob 2014-10-27 18:11:38
Upvotes: 3