Jonathan Dunne
Jonathan Dunne

Reputation: 189

How can I split large data.frame into smaller ones without using a loop?

I have a very large dataframe (20k rows) The dataframe basically contains a date / timestamp some text and a delta between the first timestamp and subsequent time stamps.

                date   text time.diff
1 2016-03-09 15:50:07 Text 1     0.000
2 2016-03-09 15:50:10 Text 2     2.808
3 2016-03-09 15:50:17 Text 3    10.128
4 2016-03-09 15:50:53 Text 4    45.952
5 2016-03-09 21:26:15 Text 5    65.053

I'd like to be able to split this dataframe into smaller chunks based on values contained in time.diff (say into chunks of 60 seconds). So for example, splitting into two using subset can be done like so, but if I have a much larger frame, I will end up writing 1000's lines of code!

I could also create a loop to iterate through a much larger dataframe and accomplish this task, but I know that using loops in R is rather slow.

So I'm wondering what approach I can take to split the larger frame into many smaller frames in a way that doesn't use a loop and can also increment smaller dataframe names e.g. df.sub.1, df.sub.2 ... df.sub.3

# Split into two frames based on matched criteria
df.split1 <- subset(df.tosplit, time.diff <= 60)
df.split2 <- subset(df.tosplit, time.diff > 60)

> df.split1
                 date   text time.diff
1 2016-03-09 15:50:07 Text 1     0.000
2 2016-03-09 15:50:10 Text 2     2.808
3 2016-03-09 15:50:17 Text 3    10.128
4 2016-03-09 15:50:53 Text 4    45.952
> df.split2
                 date   text time.diff
5 2016-03-09 21:26:15 Text 5    65.053
6 2016-03-09 21:26:20 Text 6    85.110

I've included some sample code to create the first six lines which hopefully should be enough for folks to suggest a way forward here.

# Create Data
date <- c("2016-03-09 15:50:07", "2016-03-09 15:50:10", "2016-03-09 15:50:17" ,
      "2016-03-09 15:50:53", "2016-03-09 21:26:15", "2016-03-09 21:26:20")
text <- c("Text 1", "Text 2", "Text 3", "Text 4", "Text 5", "Text 6")
time.diff <- c(0, 2.808, 10.128, 45.952, 65.053, 85.110)
df.tosplit <- data.frame(date, text, time.diff)

Upvotes: 1

Views: 841

Answers (2)

utubun
utubun

Reputation: 4520

Using split():

split(df, paste0("df.split", df$time.diff %/% 60))

$df.split0
                  dat   text time.diff
1 2016-03-09 15:50:07 Text 1     0.000
2 2016-03-09 15:50:10 Text 2     2.808
3 2016-03-09 15:50:17 Text 3    10.128
4 2016-03-09 15:50:53 Text 4    45.952

$df.split1
                  dat   text time.diff
5 2016-03-09 21:26:15 Text 5    65.053
6 2016-03-09 21:26:20 Text 6    85.110

Exotic way (see the explanation here):

list2env(split(df, paste0("df.split", df$time.diff %/% 60)), .GlobalEnv)

Upvotes: 3

Pierre L
Pierre L

Reputation: 28441

Instead of creating a loop for each case, creating a sequence with the 60 second condition as an index will enable split points.

indx <- cut(df.tosplit$time.diff, 
            c(seq(0, max(df.tosplit$time.diff), by=60),Inf), 
            right=FALSE)
split(df.tosplit, indx)
# $`[0,60)`
#                  date   text time.diff
# 1 2016-03-09 15:50:07 Text 1     0.000
# 2 2016-03-09 15:50:10 Text 2     2.808
# 3 2016-03-09 15:50:17 Text 3    10.128
# 4 2016-03-09 15:50:53 Text 4    45.952
# 
# $`[60,Inf)`
#                  date   text time.diff
# 5 2016-03-09 21:26:15 Text 5    65.053
# 6 2016-03-09 21:26:20 Text 6    85.110

Upvotes: 1

Related Questions