Reputation: 87
I have a problem with for loop. I have a dataframe with 120 unique IDs. I want to split the dataframe into 120 different dataframes based on the ID. I split it using the following code:
split_part0 <- split(PART0_DF, PART0_DF$sysid)
Now I want to do something like
for(i in 1:120){
sys[i] <- as.data.frame(split_part0[[i]])}
This way I have the 120 dataframes with unique frame names I can use for further analysis.
Is using 'for loop' in this particular case not possible? If so, what other commands can I use?
Dummy data for PART0_DF
:
Date sysid power temperature
1.1.2018 1 1000 14
2.1.2018 1 1200 16
3.1.2018 1 800 18
1.1.2018 2 1500 8
2.1.2018 2 800 18
3.1.2018 2 1300 11
I want the output to be like
>>sys1
Date sysid power temperature
1.1.2018 1 1000 14
2.1.2018 1 1200 16
3.1.2018 1 800 18
>>sys2
1.1.2018 2 1500 8
2.1.2018 2 800 18
3.1.2018 2 1300 11
Upvotes: 1
Views: 5322
Reputation: 10855
An easy way to do this is to create a factor vector by appending the string sys
to the id numbers, and using it to split the data. There is no need to use a for()
loop to produce the desired output, since the result of split()
is a list of data frames when the input to be split is a data frame.
The value of the factor is used to name each element in the list generated by split()
. In the case of the OP, since sysid
is numeric and starts with 1, it's not obvious that the id numbers are being used to name the resulting data frames in the list, as explained in the help for split()
.
Using the data from the OP we'll illustrate how to use the sysid
column to create a factor variable that combines the string sys
with the id values, and split it into a list of data frames that can be accessed by name.
rawData <- "Date sysid power temperature
1.1.2018 1 1000 14
2.1.2018 1 1200 16
3.1.2018 1 800 18
1.1.2018 2 1500 8
2.1.2018 2 800 18
3.1.2018 2 1300 11"
data <- read.table(text = rawData,header=TRUE)
sysidName <- paste0("sys",data$sysid)
splitData <- split(data,sysidName)
splitData
...and the output:
> splitData
$`sys1`
Date sysid power temperature
1 1.1.2018 1 1000 14
2 2.1.2018 1 1200 16
3 3.1.2018 1 800 18
$sys2
Date sysid power temperature
4 1.1.2018 2 1500 8
5 2.1.2018 2 800 18
6 3.1.2018 2 1300 11
>
At this point one can access individual data frames in the list by using the $
form of the extract operator:
> splitData$sys1
Date sysid power temperature sysidName
1 1.1.2018 1 1000 14 sys1
2 2.1.2018 1 1200 16 sys1
3 3.1.2018 1 800 18 sys1
>
Also, by using the names()
function one can obtain a vector of all the named elements in the list of data frames.
> names(splitData)
[1] "sys1" "sys2"
>
Reiterating the main point from the top of the answer, when split()
is used with a data frame, the resulting list is a list of objects of type data.frame()
. For example:
> str(splitData["sys1"])
List of 1
$ sys1:'data.frame': 3 obs. of 4 variables:
..$ Date : Factor w/ 3 levels "1.1.2018","2.1.2018",..: 1 2 3
..$ sysid : int [1:3] 1 1 1
..$ power : int [1:3] 1000 1200 800
..$ temperature: int [1:3] 14 16 18
>
for()
loop...Since the OP asked whether the problem could be solved with a for()
loop, the answer is "yes."
# create a vector containing unique values of sysid
ids <- unique(data$sysid)
# initialize output data frame list
dfList <- list()
# loop thru unique values and generate named data frames in list()
for(i in ids){
dfname <- paste0("sys",i)
dfList[[dfname]] <- data[data$sysid == i,]
}
dfList
...and the output:
> for(i in ids){
+ dfname <- paste0("sys",i)
+ dfList[[dfname]] <- data[data$sysid == i,]
+ }
> dfList
$`sys1`
Date sysid power temperature
1 1.1.2018 1 1000 14
2 2.1.2018 1 1200 16
3 3.1.2018 1 800 18
$sys2
Date sysid power temperature
4 1.1.2018 2 1500 8
5 2.1.2018 2 800 18
6 3.1.2018 2 1300 11
Between split()
, for()
and the other answer using by()
, how do we choose the best answer?
One way is to determine which version runs fastest, given that the real data will be much larger than the sample data from the original post.
We can use the microbenchmark
package to compare the performance of the three different approaches.
split()
performancelibrary(microbenchmark)
> microbenchmark(splitData <- split(data,sysidName),unit="us")
Unit: microseconds
expr min lq mean median uq max neval
splitData <- split(data, sysidName) 144.594 147.359 185.7987 150.1245 170.4705 615.507 100
>
for()
performance> microbenchmark(for(i in ids){
+ dfname <- paste0("sys",i)
+ dfList[[dfname]] <- data[data$sysid == i,]
+ },unit="us")
Unit: microseconds
expr min lq mean
for (i in ids) { dfname <- paste0("sys", i) dfList[[dfname]] <- data[data$sysid == i, ] } 2643.755 2857.286 3457.642
median uq max neval
3099.064 3479.311 8511.609 100
>
by()
performance> microbenchmark(df_list <- by(df, df$sysid, function(unique) unique),unit="us")
Unit: microseconds
expr min lq mean median uq max neval
df_list <- by(df, df$sysid, function(unique) unique) 256.791 260.5445 304.9296 275.9515 309.5325 1218.372 100
>
split()
, with an average runtime of 186 microseconds, versus 305 microseconds for by()
and a whopping 3,458 microseconds for the for()
loop approach.
Upvotes: 2
Reputation: 21400
Another option is using the function by()
:
df <- data.frame(
Date = c("1.1.2018", "2.1.2018", "3.1.2018", "1.1.2018", "2.1.2018", "3.1.2018"),
sysid = c(1, 1, 1, 2, 2, 2),
power = c(1000, 1200, 800, 1500, 800, 1300)
)
df
Date sysid power
1 1.1.2018 1 1000
2 2.1.2018 1 1200
3 3.1.2018 1 800
4 1.1.2018 2 1500
5 2.1.2018 2 800
6 3.1.2018 2 1300
Now split df
in as many dataframes as you have distinct ('unique') sysid
values using by()
and calling unique
:
df_list <- by(df, df$sysid, function(unique) unique)
df_list
df$sysid: 1
Date sysid power
1 1.1.2018 1 1000
2 2.1.2018 1 1200
3 3.1.2018 1 800
----------------------------------------------------------------------------------------------
df$sysid: 2
Date sysid power
4 1.1.2018 2 1500
5 2.1.2018 2 800
6 3.1.2018 2 1300
Upvotes: 1