Matthijs
Matthijs

Reputation: 133

Making subsets of recurring data from a large data set

I am trying to extract certain data from a large data set using R. The data is from a process which goes through several phases, lets say phase 0 to 5. I have a data set which contains multiple runs of the process.

I am trying to extract the data for each of the runs. I want to create subsets with the first occurrence of phase 0 to 5 and then an other subset with again the phases 0 to 5 (of the second run). The data set only contains the process data and the phase number in chronological order, it does not tell which run it is in. However, the phases are in order, so the phase column goes from 0 to 5 and then starts at 0 again.

I have already tried to organize the data using some while and for loops, however this is very slow on such a large data set (700 000 entries). Even using a small section of only 10000 entries or so takes quite long.

the data set may look something like this (second column is the phase):

01, 0, 2, 4, 5, 3, 4,
02, 0, 3, 4, 5, 2, 2,
03, 0, 4, 5, 4, 9, 8,
04, 1, 8, 9, 2, 7, 3,
05, 1, 8, 7, 0, 7, 8,
06, 2, 8, 4, 9, 7, 8,
07, 2, 9, 7, 5, 0, 8,
08, 2, 8, 6, 5, 7, 9,
07, 2, 8, 7, 6, 7, 9,
08, 3, 7, 8, 6, 7, 9,
09, 3, 7, 9, 8, 7, 8, 
10, 4, 5, 6, 7, 4, 3,
11, 4, 6, 7, 5, 6, 4,
12, 5, 6, 4, 3, 2, 2,
13, 0, 6, 3, 3, 2, 5,
14, 0, 5, 6, 3, 2, 2,
15, 1, 5, 2, 1, 4, 4,

note that the number of lines per phase are not constant.

The subsets i would expect from above example would be:

01, 0, 2, 4, 5, 3, 4,
02, 0, 3, 4, 5, 2, 2,
03, 0, 4, 5, 4, 9, 8,
04, 1, 8, 9, 2, 7, 3,
05, 1, 8, 7, 0, 7, 8,
06, 2, 8, 4, 9, 7, 8,
07, 2, 9, 7, 5, 0, 8,
08, 2, 8, 6, 5, 7, 9,
07, 2, 8, 7, 6, 7, 9,
08, 3, 7, 8, 6, 7, 9,
09, 3, 7, 9, 8, 7, 8, 
10, 4, 5, 6, 7, 4, 3,
11, 4, 6, 7, 5, 6, 4,
12, 5, 6, 4, 3, 2, 2,

and

13, 0, 6, 3, 3, 2, 5,
14, 0, 5, 6, 3, 2, 2,
15, 1, 5, 2, 1, 4, 4,

(note: in the original data set the phase will always end at 5)

Upvotes: 2

Views: 57

Answers (2)

nsinghphd
nsinghphd

Reputation: 2022

Here is a vectorized solution

# sample df
df = read.table(text = "
v1  phase   v3  v4  v5  v6  v7
01  0   2   4   5   3   4
02  0   3   4   5   2   2
03  0   4   5   4   9   8
04  1   8   9   2   7   3
05  1   8   7   0   7   8
06  2   8   4   9   7   8
07  2   9   7   5   0   8
08  2   8   6   5   7   9
07  2   8   7   6   7   9
08  3   7   8   6   7   9
09  3   7   9   8   7   8   
10  4   5   6   7   4   3
11  4   6   7   5   6   4
12  5   6   4   3   2   2
13  0   6   3   3   2   5
14  0   5   6   3   2   2
15  1   5   2   1   4   4
                ", stringsAsFactors = F, header = T)

# create groups for splitting df
grp <- cumsum( c( 0, diff( df$phase ) ) < 0 ) + 1

# split the df
split(df, grp)
#> $`1`
#>    v1 phase v3 v4 v5 v6 v7
#> 1   1     0  2  4  5  3  4
#> 2   2     0  3  4  5  2  2
#> 3   3     0  4  5  4  9  8
#> 4   4     1  8  9  2  7  3
#> 5   5     1  8  7  0  7  8
#> 6   6     2  8  4  9  7  8
#> 7   7     2  9  7  5  0  8
#> 8   8     2  8  6  5  7  9
#> 9   7     2  8  7  6  7  9
#> 10  8     3  7  8  6  7  9
#> 11  9     3  7  9  8  7  8
#> 12 10     4  5  6  7  4  3
#> 13 11     4  6  7  5  6  4
#> 14 12     5  6  4  3  2  2
#> 
#> $`2`
#>    v1 phase v3 v4 v5 v6 v7
#> 15 13     0  6  3  3  2  5
#> 16 14     0  5  6  3  2  2
#> 17 15     1  5  2  1  4  4

Created on 2019-04-15 by the reprex package (v0.2.1)

Explanation: diff computes the difference of a specific value from its preceding one, then < operator evaluates if the result is less than 0. The resulting logical vector of T, F is appended with another 0 at the front to keep the vector length same as the number of rows in df. cumsum then finds the cumulative running sum of values, which will only change when there is a smaller phase value right after phase 5, therefore essentially creating groups. Then these groups (grp) are used to split the df.

Upvotes: 1

Michiel uit het Broek
Michiel uit het Broek

Reputation: 993

In your problem, a new subset only starts if the phase index (given in the second column) decreases. Thus we can combine which with diff to get the row indices at which a phase ends.

We first can obtain the row indices at which a subset starts/ends as follows:

end   <- which(diff(data[,2]) < 0)
start <- c(1, end + 1)
end   <- c(end, nrow(data))

Then we can use this to make a list with the subsets as follows:

subsets <- vector("list", length(start))

for (idx in 1:length(start))
  subsets[[idx]] <- data[start[idx]:end[idx],]

Note that this last step still uses a for-loop. I expect that something similar can be achieved with split but I did not succeed in trying.

EDIT: Of course R has a solution that avoids the for-statement. We should create a new vector that indicates for each row to which subset it belongs. Given the above two vectors end and start this can easily be achieved (see code snippet below). Then we can use the built-in function split and transform its output to a list of matrices (inspired on this answer).

# How many rows does each subset has?
n <- end - start + 1

# Create vector that indicates for each row to which subset it belongs 
idx_subset <- rep(1:length(start), n)

# Create the subsets
subsets <- lapply( split(data, idx_subset), matrix, ncol=7)

Upvotes: 2

Related Questions