Reputation: 47

Subsetting a data frame leads to an unexpected behaviour

I'm quite new to R and don't know whether this question is stupid or not, but I have the following problem: I have two data frames, both containing a timestamp column (Posixct). I want to subset the data as follows: The new df3 shall contain all columns from df1, conditioned on the occurrence of the value "21" in the "State" column in df2. If this occurs, all observation from df1 from 2 hours before until 1 hour after the occurrence of State 21 in df2 shall be in the new df3.

here are two samples of my original data frames.

df2:

    timestamp           PlantNo State
37 2016-03-14 08:53:25       1     2
38 2016-03-14 09:31:43       1     0
39 2016-03-14 09:34:43       1     0
40 2016-03-14 13:49:18       1     2
41 2016-03-14 14:17:42       1     0
42 2016-03-14 14:20:41       1     0
43 2016-03-17 01:54:07       1     2
44 2016-03-17 07:06:23       1     0
45 2016-03-17 07:09:23       1     0
46 2016-03-17 10:10:11       1    21
47 2016-03-17 10:46:29       1     2

df1:

       timestamp        PlantNo Error avws maxws minws avrot maxrot minrot avpwr
1  2016-03-05 00:00:00       1     0  6.7   9.3   4.3 25.15  30.96  21.37    93
2  2016-03-05 00:10:00       1     0  7.9  11.1   5.5 29.05  34.26  22.89   145
3  2016-03-05 00:20:00       1     0  7.8  10.5   4.6 28.83  33.17  23.34   142
4  2016-03-05 00:30:00       1     0  7.8  10.4   5.3 28.91  33.94  24.09   142
5  2016-03-05 00:40:00       1     0  7.9  10.1   5.4 28.36  32.89  23.03   134
6  2016-03-05 00:50:00       1     0  7.8   9.2   5.4 27.83  31.41  22.48   126
7  2016-03-05 01:00:00       1     0  7.8   9.0   6.6 28.06  30.21  25.87   129
8  2016-03-05 01:10:00       1     0  7.2   9.0   4.7 24.91  29.23   9.92    91
9  2016-03-05 01:20:00       1     0  7.0   8.7   3.9 25.04  28.53  20.39    92
10 2016-03-05 01:30:00       1     0  7.2   9.1   3.7 26.04  30.39  20.75   103
11 2016-03-05 01:40:00       1     0  7.5   9.3   5.1 26.75  30.98  21.80   111
12 2016-03-05 01:50:01       1     0  7.5   8.7   4.9 26.94  28.96  22.46   114
13 2016-03-05 02:00:00       1     0  7.1   9.0   4.4 25.32  30.24  20.38    95
14 2016-03-05 02:10:00       1     0  7.5   9.0   5.0 26.47  29.35  22.85   108
15 2016-03-05 02:20:00       1     0  7.3   9.1   4.2 26.03  30.97  19.43   104
16 2016-03-05 02:30:00       1     0  6.7  10.1   3.9 24.66  30.98  20.06    88
17 2016-03-05 02:40:00       1     0  6.8   9.1   4.5 25.30  30.22  20.88    94
18 2016-03-05 02:50:00       1     0  7.2  10.1   4.2 25.95  31.17  20.74   103
19 2016-03-05 03:00:00       1     0  7.6  10.3   4.3 27.72  34.43  22.19   127
20 2016-03-05 03:10:00       1     0  8.7  12.2   5.3 31.42  35.69  25.32   189

I tried the following code:

df3 <- subset(df1, df1$timestamp > df2$timestamp[df2$State==21]-7200 & 
                   df1$timestamp < df2$timestamp[df2$State==21]+3600)

Basically this works for me, as the right timeframes are selected and saved in df3, but: only every fourth observation of df1 is going into df3. This is what I get in the end:

          timestamp      PlantNo Error avws maxws minws avrot maxrot minrot avpwr
1781 2016-03-17 08:30:00       1     0  2.2   2.7   1.6 14.57  15.85  13.52     0
1785 2016-03-17 09:10:00       1     0  1.7   2.4   0.7 10.43  13.48   8.71     0
1789 2016-03-17 09:50:00       1     0  1.9   2.9   0.7 11.62  15.91   6.86     0
1793 2016-03-17 10:30:00       1     0  2.4   4.3   0.6  0.27   1.59   0.00     0
1797 2016-03-17 11:10:00       1     0  2.7   4.2   1.7 16.38  18.76  13.17     0
3006 2016-03-25 22:40:00       1     0  5.4   6.9   4.1 19.99  21.95  19.21    41
3010 2016-03-25 23:20:00       1     0  6.0   7.1   4.6 21.43  24.59  19.59    56
3014 2016-03-26 00:00:00       1     0  5.1   6.5   4.0 19.41  20.33  18.90    30
3018 2016-03-26 00:40:00       1     0  5.2   6.6   3.0  4.06  20.82   0.00     4
3022 2016-03-26 01:20:00       1     0  5.2   6.4   3.7 19.52  20.26  18.75    33
3583 2016-03-29 23:40:00       1     0  5.7   6.8   4.9 20.57  22.80  19.60    48
3587 2016-03-30 00:20:00       1     0  6.4   7.5   5.4 22.82  25.10  20.27    68
3591 2016-03-30 01:00:00       1     0  6.1   7.4   5.2 21.94  23.99  20.52    60
3595 2016-03-30 01:40:00       1     0  4.0   5.0   2.4  2.90  18.99   0.00     1
3599 2016-03-30 02:20:00       1     0  5.4   6.2   4.5 19.63  20.05  19.26    36
3812 2016-03-31 13:50:00       1     0  2.2   4.3   0.8 13.95  18.30   7.89     1
3816 2016-03-31 14:30:00       1     0  0.8   2.1   0.0  0.00   0.10   0.00     0
3820 2016-03-31 15:10:00       1     0  0.6   1.4   0.0  0.00   0.00   0.00     0
3824 2016-03-31 15:50:00       1     0  2.5   4.4   0.6 11.54  19.12   0.00     1
3828 2016-03-31 16:30:00       1     0  2.8   4.3   1.7  5.31  18.39   0.00     1

The leftmost columns is the number of the observation in df1. Can someone tell me what I'm doing wrong?

Upvotes: 2

Answers (3)

Alex

Reputation: 4995

I am not sure if I am thinking to complicated but in my opinion subset should not work so easily.

I built an example of two data.frames with two random timestamps which are in the same range. I used an ugly nested for loop to check whether each of the timestamps in df2 is in the range of +- 1 hour of all the timestamps with state = 1 of df1. If so, state = 1 is added to df2. For the desired result you can easily subset df2.

set.seed(1)
t1   <- sort(as.POSIXct(sample(1:10000000, 1000) ,origin = "2010-01-01"))
state <- sample(0:1,10000, replace = TRUE, prob = c(0.90,0.05) )
df1 <- data.frame(t1, state)

t2 <- sort(as.POSIXct(sample(1:10000000, 1000) ,origin = "2010-01-01" ))
df2 <- data.frame(t2, ID = 1:length(t2))

df1_h <- df1[df1$state == 1, ]

df2$state <- NA
for (i in 1:nrow(df2)){
  for(j in 1:nrow(df1_h)){
    if(df2$t2[i] > df1_h$t1[j] - 3600 &  df2$t2[i] < df1_h$t1[j] + 3600) df2$state[i] <- 1
  }
} 

df3 <- df2[df2$state == 1, ]

Upvotes: 0

Robert

Reputation: 5162

Maybe your comparison df1$timestamp>(df2$timestamp[df2$State==21]-7200) is recycling. This is the case when df2$State==21 returns more than one case.

Try the following, as we don't have enough data I am not sure if works for you.

inf=df2$timestamp[df2$State==21]-7200
sup=df2$timestamp[df2$State==21]+3600

trs=list()
for (i in 1:length(inf))trs[[i]]=df1$timestamp>inf[i] & df1$timestamp<sup[i]
selv=apply(t(do.call("rbind",trs)),1,any)
df31 <- subset(df1,selv)

Upvotes: 1

Christoph

Reputation: 7063

Just a guess: Perhaps you need subset(df1, df1$timestamp > (df2$timestamp[df2$State==21]-7200) & df1$timestamp < (df2$timestamp[df2$State==21]+3600))?

Upvotes: 0

Subsetting a data frame leads to an unexpected behaviour

Answers (3)

Related Questions