Shruthi Patil
Shruthi Patil

Reputation: 87

Usage of 'for loop' in R to split a dataframe into several dataframes

I have a problem with for loop. I have a dataframe with 120 unique IDs. I want to split the dataframe into 120 different dataframes based on the ID. I split it using the following code:

split_part0 <- split(PART0_DF, PART0_DF$sysid)

Now I want to do something like

for(i in 1:120){ 
sys[i] <- as.data.frame(split_part0[[i]])}

This way I have the 120 dataframes with unique frame names I can use for further analysis. Is using 'for loop' in this particular case not possible? If so, what other commands can I use? Dummy data for PART0_DF:

 Date      sysid   power   temperature
 1.1.2018    1     1000       14
 2.1.2018    1     1200       16
 3.1.2018    1      800       18
 1.1.2018    2     1500        8
 2.1.2018    2      800       18
 3.1.2018    2     1300       11

I want the output to be like

     >>sys1
     Date      sysid   power   temperature
     1.1.2018    1     1000     14
     2.1.2018    1     1200     16
     3.1.2018    1      800     18
     >>sys2
     1.1.2018    2     1500      8
     2.1.2018    2      800     18
     3.1.2018    2     1300     11

Upvotes: 1

Views: 5322

Answers (2)

Len Greski
Len Greski

Reputation: 10855

An easy way to do this is to create a factor vector by appending the string sys to the id numbers, and using it to split the data. There is no need to use a for() loop to produce the desired output, since the result of split() is a list of data frames when the input to be split is a data frame.

The value of the factor is used to name each element in the list generated by split(). In the case of the OP, since sysid is numeric and starts with 1, it's not obvious that the id numbers are being used to name the resulting data frames in the list, as explained in the help for split().

Using the data from the OP we'll illustrate how to use the sysid column to create a factor variable that combines the string sys with the id values, and split it into a list of data frames that can be accessed by name.

rawData <- "Date      sysid   power   temperature
 1.1.2018    1     1000       14
 2.1.2018    1     1200       16
 3.1.2018    1      800       18
 1.1.2018    2     1500        8
 2.1.2018    2      800       18
 3.1.2018    2     1300       11"

data <- read.table(text = rawData,header=TRUE)
sysidName <- paste0("sys",data$sysid)

splitData <- split(data,sysidName)

splitData

...and the output:

> splitData
$`sys1`
      Date sysid power temperature
1 1.1.2018     1  1000          14
2 2.1.2018     1  1200          16
3 3.1.2018     1   800          18

$sys2
      Date sysid power temperature
4 1.1.2018     2  1500           8
5 2.1.2018     2   800          18
6 3.1.2018     2  1300          11

>

At this point one can access individual data frames in the list by using the $ form of the extract operator:

> splitData$sys1
      Date sysid power temperature sysidName
1 1.1.2018     1  1000          14      sys1
2 2.1.2018     1  1200          16      sys1
3 3.1.2018     1   800          18      sys1
>

Also, by using the names() function one can obtain a vector of all the named elements in the list of data frames.

> names(splitData)
[1] "sys1" "sys2"
> 

Reiterating the main point from the top of the answer, when split() is used with a data frame, the resulting list is a list of objects of type data.frame(). For example:

> str(splitData["sys1"])
List of 1
 $ sys1:'data.frame':   3 obs. of  4 variables:
  ..$ Date       : Factor w/ 3 levels "1.1.2018","2.1.2018",..: 1 2 3
  ..$ sysid      : int [1:3] 1 1 1
  ..$ power      : int [1:3] 1000 1200 800
  ..$ temperature: int [1:3] 14 16 18
>

If you must use a for() loop...

Since the OP asked whether the problem could be solved with a for() loop, the answer is "yes."

# create a vector containing unique values of sysid
ids <- unique(data$sysid)
# initialize output data frame list 
dfList <- list() 
# loop thru unique values and generate named data frames in list() 
for(i in ids){
     dfname <- paste0("sys",i)
     dfList[[dfname]] <- data[data$sysid == i,]
}
dfList 

...and the output:

> for(i in ids){
+      dfname <- paste0("sys",i)
+      dfList[[dfname]] <- data[data$sysid == i,]
+ }
> dfList
$`sys1`
      Date sysid power temperature
1 1.1.2018     1  1000          14
2 2.1.2018     1  1200          16
3 3.1.2018     1   800          18

$sys2
      Date sysid power temperature
4 1.1.2018     2  1500           8
5 2.1.2018     2   800          18
6 3.1.2018     2  1300          11

Choosing the "best" answer

Between split(), for() and the other answer using by(), how do we choose the best answer?

One way is to determine which version runs fastest, given that the real data will be much larger than the sample data from the original post.

We can use the microbenchmark package to compare the performance of the three different approaches.

split() performance

library(microbenchmark)
> microbenchmark(splitData <- split(data,sysidName),unit="us")
Unit: microseconds
                                expr     min      lq     mean   median       uq     max neval
 splitData <- split(data, sysidName) 144.594 147.359 185.7987 150.1245 170.4705 615.507   100
> 

for() performance

> microbenchmark(for(i in ids){
+      dfname <- paste0("sys",i)
+      dfList[[dfname]] <- data[data$sysid == i,]
+ },unit="us")
Unit: microseconds
                                                                                              expr      min       lq     mean
 for (i in ids) {     dfname <- paste0("sys", i)     dfList[[dfname]] <- data[data$sysid == i, ] } 2643.755 2857.286 3457.642
   median       uq      max neval
 3099.064 3479.311 8511.609   100
>

by() performance

> microbenchmark(df_list <- by(df, df$sysid, function(unique) unique),unit="us")
Unit: microseconds
                                                 expr     min       lq     mean   median       uq      max neval
 df_list <- by(df, df$sysid, function(unique) unique) 256.791 260.5445 304.9296 275.9515 309.5325 1218.372   100
>

...and the winner is:

split(), with an average runtime of 186 microseconds, versus 305 microseconds for by() and a whopping 3,458 microseconds for the for() loop approach.

Upvotes: 2

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

Another option is using the function by():

df <- data.frame(
  Date = c("1.1.2018",  "2.1.2018", "3.1.2018", "1.1.2018", "2.1.2018", "3.1.2018"),
  sysid = c(1, 1, 1, 2, 2, 2),
  power = c(1000, 1200, 800, 1500, 800, 1300)
  )
df
  Date sysid power
1 1.1.2018     1  1000
2 2.1.2018     1  1200
3 3.1.2018     1   800
4 1.1.2018     2  1500
5 2.1.2018     2   800
6 3.1.2018     2  1300

Now split df in as many dataframes as you have distinct ('unique') sysid values using by() and calling unique:

df_list <- by(df, df$sysid, function(unique) unique)
df_list
df$sysid: 1
      Date sysid power
1 1.1.2018     1  1000
2 2.1.2018     1  1200
3 3.1.2018     1   800
---------------------------------------------------------------------------------------------- 
df$sysid: 2
      Date sysid power
4 1.1.2018     2  1500
5 2.1.2018     2   800
6 3.1.2018     2  1300

Upvotes: 1

Related Questions