Reputation: 5660
I have the following data frame in R:
> head(df)
date x y z n t
1 2012-01-01 1 1 1 0 52
2 2012-01-01 1 1 2 0 52
3 2012-01-01 1 1 3 0 52
4 2012-01-01 1 1 4 0 52
5 2012-01-01 1 1 5 0 52
6 2012-01-01 1 1 6 0 52
> str(df)
'data.frame': 4617600 obs. of 6 variables:
$ date: Date, format: "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" ...
$ x : Factor w/ 45 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ y : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ z : Factor w/ 111 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ n : int 0 0 0 0 0 0 0 0 29 0 ...
$ t : num 52 52 52 52 52 52 52 52 52 52 ...
What I want to do is split this large df into smaller data frames as follows: 1) I want to have 45 data frames for each factor value of 'x'. 2) I want to further split these 45 data frames for each factor value of 'z'. So I want a total of 45*111=4995 data frames.
I've seen plenty online about splitting data frames, which turns them into lists. However, I'm not seeing how to further split lists. Another concern I have is with computer memory. If I split the data frame into lists, will it not still take up as much computer memory? If I then want to run some prediction models on the split data, it seems impossible to do. Ideally I would split the data into many data frames, run prediction models on the first split data frame, get the results I need, and then delete it before moving on to the next one.
Upvotes: 0
Views: 87
Reputation: 66819
Here's what I would do. Your data already fits in memory, so just leave it in one piece:
require(data.table)
setDT(df)
df[,{
sum(t*n) # or whatever you're doing for "prediction models"
},by=list(x,z)]
Upvotes: 1