Making subsets of recurring data from a large data set

Question

I am trying to extract certain data from a large data set using R. The data is from a process which goes through several phases, lets say phase 0 to 5. I have a data set which contains multiple runs of the process.

I am trying to extract the data for each of the runs. I want to create subsets with the first occurrence of phase 0 to 5 and then an other subset with again the phases 0 to 5 (of the second run). The data set only contains the process data and the phase number in chronological order, it does not tell which run it is in. However, the phases are in order, so the phase column goes from 0 to 5 and then starts at 0 again.

I have already tried to organize the data using some while and for loops, however this is very slow on such a large data set (700 000 entries). Even using a small section of only 10000 entries or so takes quite long.

the data set may look something like this (second column is the phase):

01, 0, 2, 4, 5, 3, 4,
02, 0, 3, 4, 5, 2, 2,
03, 0, 4, 5, 4, 9, 8,
04, 1, 8, 9, 2, 7, 3,
05, 1, 8, 7, 0, 7, 8,
06, 2, 8, 4, 9, 7, 8,
07, 2, 9, 7, 5, 0, 8,
08, 2, 8, 6, 5, 7, 9,
07, 2, 8, 7, 6, 7, 9,
08, 3, 7, 8, 6, 7, 9,
09, 3, 7, 9, 8, 7, 8, 
10, 4, 5, 6, 7, 4, 3,
11, 4, 6, 7, 5, 6, 4,
12, 5, 6, 4, 3, 2, 2,
13, 0, 6, 3, 3, 2, 5,
14, 0, 5, 6, 3, 2, 2,
15, 1, 5, 2, 1, 4, 4,

note that the number of lines per phase are not constant.

The subsets i would expect from above example would be:

01, 0, 2, 4, 5, 3, 4,
02, 0, 3, 4, 5, 2, 2,
03, 0, 4, 5, 4, 9, 8,
04, 1, 8, 9, 2, 7, 3,
05, 1, 8, 7, 0, 7, 8,
06, 2, 8, 4, 9, 7, 8,
07, 2, 9, 7, 5, 0, 8,
08, 2, 8, 6, 5, 7, 9,
07, 2, 8, 7, 6, 7, 9,
08, 3, 7, 8, 6, 7, 9,
09, 3, 7, 9, 8, 7, 8, 
10, 4, 5, 6, 7, 4, 3,
11, 4, 6, 7, 5, 6, 4,
12, 5, 6, 4, 3, 2, 2,

and

13, 0, 6, 3, 3, 2, 5,
14, 0, 5, 6, 3, 2, 2,
15, 1, 5, 2, 1, 4, 4,

(note: in the original data set the phase will always end at 5)

nsinghphd · Accepted Answer

Here is a vectorized solution

# sample df
df = read.table(text = "
v1  phase   v3  v4  v5  v6  v7
01  0   2   4   5   3   4
02  0   3   4   5   2   2
03  0   4   5   4   9   8
04  1   8   9   2   7   3
05  1   8   7   0   7   8
06  2   8   4   9   7   8
07  2   9   7   5   0   8
08  2   8   6   5   7   9
07  2   8   7   6   7   9
08  3   7   8   6   7   9
09  3   7   9   8   7   8   
10  4   5   6   7   4   3
11  4   6   7   5   6   4
12  5   6   4   3   2   2
13  0   6   3   3   2   5
14  0   5   6   3   2   2
15  1   5   2   1   4   4
                ", stringsAsFactors = F, header = T)

# create groups for splitting df
grp <- cumsum( c( 0, diff( df$phase ) ) < 0 ) + 1

# split the df
split(df, grp)
#> $`1`
#>    v1 phase v3 v4 v5 v6 v7
#> 1   1     0  2  4  5  3  4
#> 2   2     0  3  4  5  2  2
#> 3   3     0  4  5  4  9  8
#> 4   4     1  8  9  2  7  3
#> 5   5     1  8  7  0  7  8
#> 6   6     2  8  4  9  7  8
#> 7   7     2  9  7  5  0  8
#> 8   8     2  8  6  5  7  9
#> 9   7     2  8  7  6  7  9
#> 10  8     3  7  8  6  7  9
#> 11  9     3  7  9  8  7  8
#> 12 10     4  5  6  7  4  3
#> 13 11     4  6  7  5  6  4
#> 14 12     5  6  4  3  2  2
#> 
#> $`2`
#>    v1 phase v3 v4 v5 v6 v7
#> 15 13     0  6  3  3  2  5
#> 16 14     0  5  6  3  2  2
#> 17 15     1  5  2  1  4  4

^{Created on 2019-04-15 by the reprex package (v0.2.1)}

Explanation: diff computes the difference of a specific value from its preceding one, then < operator evaluates if the result is less than 0. The resulting logical vector of T, F is appended with another 0 at the front to keep the vector length same as the number of rows in df. cumsum then finds the cumulative running sum of values, which will only change when there is a smaller phase value right after phase 5, therefore essentially creating groups. Then these groups (grp) are used to split the df.

Making subsets of recurring data from a large data set

Answers (2)

Related Questions