user1667477
user1667477

Reputation: 91

how to split a data frame into multiple data frames using a conditional statement in R

I have data that looks like this:

time <- c(1:20)
temp <- c(2,3,4,5,6,2,3,4,5,6,2,3,4,5,6,2,3,4,5,6)
data <- data.frame(time,temp)

this is a very basic representation of my data. If you plot this, you can see easily that there are 4 up-sloping groups of data. I want to split the original data frame in to these 4 "subsets" so that I can run calculations on them, like "mean", "max", "min" and "std". I'd like to use the split() but it will only split based on a factor level. I'd like to be able to feed split a conditional statement, such as split if: diff(data$temp) > -2.

My problem is actually much more complex than this, but is there a function like split that will allow me to create new data frames based on a conditional statement? as apposed to splitting based on factor levels.

Thanks all!

Upvotes: 5

Views: 3188

Answers (2)

Rcoster
Rcoster

Reputation: 3210

If your data isn't so behaved, you can use cut() to create the categorical variable. The only 'problem' is that it's 100% manual.

time <- c(1:200)
temp <- (time %% 51) * (-1)^(time %/% 51) + rnorm(200)
data <- data.frame(time,temp) 

layout(matrix(c(1, 1, 2, 2, 3, 4, 5 ,6), nrow=2))
plot(data, main='All data')

time2 <- cut(time, c(0, 50, 101, 152, 200))
plot(data, col=time2, main='All data, by time2')
data2 <- split(data, time2)

for (i in 1:4) {
 plot(data2[[i]], main=names(data2)[i])
}

EDIT:

Now a 100% automatic process:

time <- c(1:200)
temp <- (time %% 51) * (-1)^(time %/% 51) + rnorm(200)
data <- data.frame(time,temp) 

layout(matrix(c(1, 1, 2, 2, 3, 4, 5 ,6), nrow=2))
plot(data, main='All data')


tol <- 10 # Here you set the minimum value to consider as a structural break
time2 <- cut(time, c(0, which(abs(diff(data$temp)) >= tol), max(time)), rigth=FALSE)

plot(data, col=time2, main='All data, by time2')
data2 <- split(data, time2)

for (i in 1:4) {
 plot(data2[[i]], main=names(data2)[i])
}

Upvotes: 0

Blue Magister
Blue Magister

Reputation: 13363

The trick is to convert your conditional statement into something that can be construed as a factor. In this particular example:

tmp <- c(1,diff(data[[2]]))
#  [1]  1  1  1  1  1 -4  1  1  1  1 -4  1  1  1  1 -4  1  1  1  1
tmp2 <- tmp < 0
# [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
# [13] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
tmp3 <- cumsum(tmp2)
#  [1] 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
split(data, tmp3)
# $`0`
#   time temp
# 1    1    2
# 2    2    3
# 3    3    4
# 4    4    5
# 5    5    6
# 
# $`1`
#    time temp
# 6     6    2
# 7     7    3
# 8     8    4
# 9     9    5
# 10   10    6
# 
# $`2`
#    time temp
# 11   11    2
# 12   12    3
# 13   13    4
# 14   14    5
# 15   15    6
# 
# $`3`
#    time temp
# 16   16    2
# 17   17    3
# 18   18    4
# 19   19    5
# 20   20    6

Upvotes: 4

Related Questions