Reputation: 41
I have a large dataframe with over 1m records by over 40 variables that I want to run through a loop for values update to prevent glitch, and the size of the file varies each time. I would like split the over 1m rows into n smaller sets, each with a new dataframe name such as ending with 1,2,...,n, newdf1, newdf2, newdf3,... . The split function in R only split the vector but it is still in that dataframe, not creating multiple subset of the dataframe which I would need to retain the data type in the 40 variables. I don't have any values to split by column, I just want to split by the row number in order.
Upvotes: 2
Views: 6934
Reputation: 10845
Here is an example where we generate a data frame with 1 million rows, split it into 20 groups, name the data frames in the resulting list, and run summary()
on the first data frame in the list by extracting it with the $
operator by name.
set.seed(90125)
df <- as.data.frame(matrix(rnorm(40000000,mean = 25,sd=5),ncol=40))
# make group number start with 1, as remainders
# vary from 0 - 19
df$group <- 1:1000000 %% 20 + 1
dfList <- split(df,df$group)
At this point there are two copies of the data in memory: the original data frame, df
, and a second copy that is split into 20 different data frames within the list dfList
.
We can demonstrate this with the length()
function:
> length(dfList)
[1] 20
Each data frame has 50,000 observations, which we can demonstrate with the nrow()
function:
> nrow(dfList[[1]])
[1] 50000
We can assign names to the data frames with a combination of names()
and paste0()
.
# add names to the list and print the names
names(dfList) <- paste0("data",1:20)
names(dfList)
> names(dfList)
[1] "data1" "data2" "data3" "data4" "data5" "data6" "data7" "data8"
[9] "data9" "data10" "data11" "data12" "data13" "data14" "data15" "data16"
[17] "data17" "data18" "data19" "data20"
Once the data frames have been named, they can be accessed from the list via the $
form of the extract operator. We'll generate a summary of the first data frame, accessing it by name.
summary(dfList$data1[1:5])
...and the output:
> summary(dfList$data1[1:5])
V1 V2 V3 V4
Min. :-0.7251 Min. : 2.481 Min. : 3.02 Min. : 3.173
1st Qu.:21.5919 1st Qu.:21.603 1st Qu.:21.57 1st Qu.:21.526
Median :24.9990 Median :24.982 Median :24.97 Median :24.914
Mean :24.9968 Mean :24.978 Mean :24.97 Mean :24.931
3rd Qu.:28.3971 3rd Qu.:28.330 3rd Qu.:28.32 3rd Qu.:28.357
Max. :45.7101 Max. :44.730 Max. :48.03 Max. :45.506
V5
Min. : 2.427
1st Qu.:21.595
Median :25.010
Mean :24.997
3rd Qu.:28.406
Max. :44.199
>
NOTE: the original data frame consumes about 312.8Mb, so after execution of the split()
function, the two copies of data will consume about 625.6Mb of RAM. One can use the rm()
function to remove the original data frame from RAM if necessary.
Upvotes: 4
Reputation: 111
Without seeing a minimally reproducible example with some sample data, I think that I have an idea of what you're trying to accomplish. Lets say that you have a data set of 10000 rows, and want to split your data into separate data frames of 100 rows each. You could subset the larger data frame into a list of smaller data frames based on the length that you want those smaller data frames to be.
#### Some dummy data here ####
colA <- rep(c("X1", "X2", "X3", "X4", "X5"), 2000)
colB <- rnorm(n = 10000, mean = 3, sd = 0.25)
colC <- rnorm(n = 10000, mean = 1, sd = 1)
df <- as.data.frame(cbind(colA, colB, colC), stringsAsFactors = FALSE)
#### Create an empty list to be filled with your subset data frames ####
dataList <- list()
#### Since it's 10000 rows total, and you want 100 smaller data frames, ####
#### create an iterator variable that you can reference for subsetting. ####
listIterator <- seq(0,99,1)
for(i in listIterator){
dataList[[i+1]] <- df[c((1+(100*i)):(100+(100*i))),]
}
dataList[1]
dataList[2]
You could then subset the list of data frames into whichever data frame you're going to work with at a particular time . . .
df1 <- dataList[1]
Or reference particular columns within each
dataList[[1]]$colA
Upvotes: 1