chichi
chichi

Reputation: 41

Split a large dataframe into multiple dataframes by row in R

I have a large dataframe with over 1m records by over 40 variables that I want to run through a loop for values update to prevent glitch, and the size of the file varies each time. I would like split the over 1m rows into n smaller sets, each with a new dataframe name such as ending with 1,2,...,n, newdf1, newdf2, newdf3,... . The split function in R only split the vector but it is still in that dataframe, not creating multiple subset of the dataframe which I would need to retain the data type in the 40 variables. I don't have any values to split by column, I just want to split by the row number in order.

Upvotes: 2

Views: 6934

Answers (2)

Len Greski
Len Greski

Reputation: 10845

Here is an example where we generate a data frame with 1 million rows, split it into 20 groups, name the data frames in the resulting list, and run summary() on the first data frame in the list by extracting it with the $ operator by name.

set.seed(90125)
df <- as.data.frame(matrix(rnorm(40000000,mean = 25,sd=5),ncol=40))
# make group number start with 1, as remainders
# vary from 0 - 19
df$group <- 1:1000000 %% 20 + 1
dfList <- split(df,df$group)

At this point there are two copies of the data in memory: the original data frame, df, and a second copy that is split into 20 different data frames within the list dfList.

We can demonstrate this with the length() function:

> length(dfList)
[1] 20

Each data frame has 50,000 observations, which we can demonstrate with the nrow() function:

> nrow(dfList[[1]])
[1] 50000

We can assign names to the data frames with a combination of names() and paste0().

# add names to the list and print the names
names(dfList) <- paste0("data",1:20)
names(dfList)
> names(dfList)
 [1] "data1"  "data2"  "data3"  "data4"  "data5"  "data6"  "data7"  "data8" 
 [9] "data9"  "data10" "data11" "data12" "data13" "data14" "data15" "data16"
[17] "data17" "data18" "data19" "data20"

Once the data frames have been named, they can be accessed from the list via the $ form of the extract operator. We'll generate a summary of the first data frame, accessing it by name.

summary(dfList$data1[1:5])

...and the output:

> summary(dfList$data1[1:5])
       V1                V2               V3              V4        
 Min.   :-0.7251   Min.   : 2.481   Min.   : 3.02   Min.   : 3.173  
 1st Qu.:21.5919   1st Qu.:21.603   1st Qu.:21.57   1st Qu.:21.526  
 Median :24.9990   Median :24.982   Median :24.97   Median :24.914  
 Mean   :24.9968   Mean   :24.978   Mean   :24.97   Mean   :24.931  
 3rd Qu.:28.3971   3rd Qu.:28.330   3rd Qu.:28.32   3rd Qu.:28.357  
 Max.   :45.7101   Max.   :44.730   Max.   :48.03   Max.   :45.506  
       V5        
 Min.   : 2.427  
 1st Qu.:21.595  
 Median :25.010  
 Mean   :24.997  
 3rd Qu.:28.406  
 Max.   :44.199  
> 

NOTE: the original data frame consumes about 312.8Mb, so after execution of the split() function, the two copies of data will consume about 625.6Mb of RAM. One can use the rm() function to remove the original data frame from RAM if necessary.

Upvotes: 4

aromatic6tet
aromatic6tet

Reputation: 111

Without seeing a minimally reproducible example with some sample data, I think that I have an idea of what you're trying to accomplish. Lets say that you have a data set of 10000 rows, and want to split your data into separate data frames of 100 rows each. You could subset the larger data frame into a list of smaller data frames based on the length that you want those smaller data frames to be.


####  Some dummy data here ####
colA <- rep(c("X1", "X2", "X3", "X4", "X5"), 2000)
colB <- rnorm(n = 10000, mean = 3, sd = 0.25)
colC <- rnorm(n = 10000, mean = 1, sd = 1)

df <- as.data.frame(cbind(colA, colB, colC), stringsAsFactors = FALSE)

#### Create an empty list to be filled with your subset data frames ####
dataList <- list()

#### Since it's 10000 rows total, and you want 100 smaller data frames, ####
#### create an iterator variable that you can reference for subsetting. ####
listIterator <- seq(0,99,1)

for(i in listIterator){
  dataList[[i+1]] <- df[c((1+(100*i)):(100+(100*i))),]  
}

dataList[1]
dataList[2]

You could then subset the list of data frames into whichever data frame you're going to work with at a particular time . . .

df1 <- dataList[1]

Or reference particular columns within each

dataList[[1]]$colA

Upvotes: 1

Related Questions