Reputation: 189
I have a very large dataframe (20k rows) The dataframe basically contains a date / timestamp some text and a delta between the first timestamp and subsequent time stamps.
date text time.diff
1 2016-03-09 15:50:07 Text 1 0.000
2 2016-03-09 15:50:10 Text 2 2.808
3 2016-03-09 15:50:17 Text 3 10.128
4 2016-03-09 15:50:53 Text 4 45.952
5 2016-03-09 21:26:15 Text 5 65.053
I'd like to be able to split this dataframe into smaller chunks based on values contained in time.diff (say into chunks of 60 seconds). So for example, splitting into two using subset can be done like so, but if I have a much larger frame, I will end up writing 1000's lines of code!
I could also create a loop to iterate through a much larger dataframe and accomplish this task, but I know that using loops in R is rather slow.
So I'm wondering what approach I can take to split the larger frame into many smaller frames in a way that doesn't use a loop and can also increment smaller dataframe names e.g. df.sub.1, df.sub.2 ... df.sub.3
# Split into two frames based on matched criteria
df.split1 <- subset(df.tosplit, time.diff <= 60)
df.split2 <- subset(df.tosplit, time.diff > 60)
> df.split1
date text time.diff
1 2016-03-09 15:50:07 Text 1 0.000
2 2016-03-09 15:50:10 Text 2 2.808
3 2016-03-09 15:50:17 Text 3 10.128
4 2016-03-09 15:50:53 Text 4 45.952
> df.split2
date text time.diff
5 2016-03-09 21:26:15 Text 5 65.053
6 2016-03-09 21:26:20 Text 6 85.110
I've included some sample code to create the first six lines which hopefully should be enough for folks to suggest a way forward here.
# Create Data
date <- c("2016-03-09 15:50:07", "2016-03-09 15:50:10", "2016-03-09 15:50:17" ,
"2016-03-09 15:50:53", "2016-03-09 21:26:15", "2016-03-09 21:26:20")
text <- c("Text 1", "Text 2", "Text 3", "Text 4", "Text 5", "Text 6")
time.diff <- c(0, 2.808, 10.128, 45.952, 65.053, 85.110)
df.tosplit <- data.frame(date, text, time.diff)
Upvotes: 1
Views: 841
Reputation: 4520
Using split()
:
split(df, paste0("df.split", df$time.diff %/% 60))
$df.split0
dat text time.diff
1 2016-03-09 15:50:07 Text 1 0.000
2 2016-03-09 15:50:10 Text 2 2.808
3 2016-03-09 15:50:17 Text 3 10.128
4 2016-03-09 15:50:53 Text 4 45.952
$df.split1
dat text time.diff
5 2016-03-09 21:26:15 Text 5 65.053
6 2016-03-09 21:26:20 Text 6 85.110
Exotic way (see the explanation here):
list2env(split(df, paste0("df.split", df$time.diff %/% 60)), .GlobalEnv)
Upvotes: 3
Reputation: 28441
Instead of creating a loop for each case, creating a sequence with the 60 second condition as an index will enable split points.
indx <- cut(df.tosplit$time.diff,
c(seq(0, max(df.tosplit$time.diff), by=60),Inf),
right=FALSE)
split(df.tosplit, indx)
# $`[0,60)`
# date text time.diff
# 1 2016-03-09 15:50:07 Text 1 0.000
# 2 2016-03-09 15:50:10 Text 2 2.808
# 3 2016-03-09 15:50:17 Text 3 10.128
# 4 2016-03-09 15:50:53 Text 4 45.952
#
# $`[60,Inf)`
# date text time.diff
# 5 2016-03-09 21:26:15 Text 5 65.053
# 6 2016-03-09 21:26:20 Text 6 85.110
Upvotes: 1