Reputation: 526
I need to split a large tibble object based on the value of one of the columns. In the example below, I calculate the difference between each subsequent values of column C
. I need to split the tibble into a list of tibbles when the difference exceed a certain value (here D > 2
).
require(dplyr)
TT <- tibble(A = 1:20, B = 21:40, C = c(1:5, 8:11, 18:20, 23:30))
TT <- TT %>%
mutate(D = C - dplyr::lag(C),
E = D > 2)
TT
# A tibble: 20 x 5
A B C D E
<int> <int> <int> <int> <lgl>
1 1 21 1 NA NA
2 2 22 2 1 FALSE
3 3 23 3 1 FALSE
4 4 24 4 1 FALSE
5 5 25 5 1 FALSE
6 6 26 8 3 TRUE
7 7 27 9 1 FALSE
8 8 28 10 1 FALSE
9 9 29 11 1 FALSE
10 10 30 18 7 TRUE
11 11 31 19 1 FALSE
12 12 32 20 1 FALSE
13 13 33 23 3 TRUE
14 14 34 24 1 FALSE
15 15 35 25 1 FALSE
16 16 36 26 1 FALSE
17 17 37 27 1 FALSE
18 18 38 28 1 FALSE
19 19 39 29 1 FALSE
20 20 40 30 1 FALSE
The output should be a list of four tibbles, like this:
# A tibble: 5 x 3
A B C
<int> <int> <int>
1 1 21 1
2 2 22 2
3 3 23 3
4 4 24 4
5 5 25 5
# A tibble: 4 x 3
A B C
<int> <int> <int>
1 6 26 8
2 7 27 9
3 8 28 10
4 9 29 11
# A tibble: 3 x 3
A B C
<int> <int> <int>
1 10 30 18
2 11 31 19
3 12 32 20
# A tibble: 8 x 3
A B C
<int> <int> <int>
1 13 33 23
2 14 34 24
3 15 35 25
4 16 36 26
5 17 37 27
6 18 38 28
7 19 39 29
8 20 40 30
Thanks
Upvotes: 1
Views: 1595
Reputation: 1211
Another option is to use indexing to split on TRUE within an lapply()
, like this:
TT$E[is.na(TT$D)]<-FALSE #Set NA's to FALSE
#index vector of TRUEs, and length
Ts<-c(which(TT$E),length(TT$E)+1)
#lappy with indexing on True
lapply(1:length(Ts), function(x){
A_index <- max(1,Ts[x-1])
B_index <- min(length(TT$E),Ts[x]-1)
TT[A_index:B_index,1:3]) })
Result:
[[1]]
# A tibble: 5 x 3
A B C
<int> <int> <int>
1 1 21 1
2 2 22 2
3 3 23 3
4 4 24 4
5 5 25 5
[[2]]
# A tibble: 4 x 3
A B C
<int> <int> <int>
1 6 26 8
2 7 27 9
3 8 28 10
4 9 29 11
[[3]]
# A tibble: 3 x 3
A B C
<int> <int> <int>
1 10 30 18
2 11 31 19
3 12 32 20
[[4]]
# A tibble: 8 x 3
A B C
<int> <int> <int>
1 13 33 23
2 14 34 24
3 15 35 25
4 16 36 26
5 17 37 27
6 18 38 28
7 19 39 29
8 20 40 30
Upvotes: 0
Reputation: 79338
Another way is to directly use compute the difference within the split function:
split(TT, ~cumsum(c(1, diff(C)) > 1))
$`0`
# A tibble: 5 x 3
A B C
<int> <int> <int>
1 1 21 1
2 2 22 2
3 3 23 3
4 4 24 4
5 5 25 5
$`1`
# A tibble: 4 x 3
A B C
<int> <int> <int>
1 6 26 8
2 7 27 9
3 8 28 10
4 9 29 11
$`2`
# A tibble: 3 x 3
A B C
<int> <int> <int>
1 10 30 18
2 11 31 19
3 12 32 20
$`3`
# A tibble: 8 x 3
A B C
<int> <int> <int>
1 13 33 23
2 14 34 24
3 15 35 25
4 16 36 26
5 17 37 27
6 18 38 28
7 19 39 29
8 20 40 30
Upvotes: 0
Reputation: 21938
We can also do this as an alternative:
library(dplyr)
library(purrr)
TT %>%
mutate(D = C - dplyr::lag(C, default = 0),
E = cumsum(+(D > 2))) %>%
split(.$E) %>%
map(~ .x %>%
select(A:C))
$`0`
# A tibble: 5 x 3
A B C
<int> <int> <int>
1 1 21 1
2 2 22 2
3 3 23 3
4 4 24 4
5 5 25 5
$`1`
# A tibble: 4 x 3
A B C
<int> <int> <int>
1 6 26 8
2 7 27 9
3 8 28 10
4 9 29 11
$`2`
# A tibble: 3 x 3
A B C
<int> <int> <int>
1 10 30 18
2 11 31 19
3 12 32 20
$`3`
# A tibble: 8 x 3
A B C
<int> <int> <int>
1 13 33 23
2 14 34 24
3 15 35 25
4 16 36 26
5 17 37 27
6 18 38 28
7 19 39 29
8 20 40 30
Upvotes: 1
Reputation: 887851
We can create a grouping column with cumsum
, select
the columns of interest and use group_split
to split the dataset into a list
of datasets
library(dplyr)
library(tidyr)
TT %>%
mutate(grp = cumsum(replace_na(E, FALSE))) %>%
select(A:C, grp) %>%
group_split(grp, .keep = FALSE)
-ouptut
[[1]]
# A tibble: 5 x 3
A B C
<int> <int> <int>
1 1 21 1
2 2 22 2
3 3 23 3
4 4 24 4
5 5 25 5
[[2]]
# A tibble: 4 x 3
A B C
<int> <int> <int>
1 6 26 8
2 7 27 9
3 8 28 10
4 9 29 11
[[3]]
# A tibble: 3 x 3
A B C
<int> <int> <int>
1 10 30 18
2 11 31 19
3 12 32 20
[[4]]
# A tibble: 8 x 3
A B C
<int> <int> <int>
1 13 33 23
2 14 34 24
3 15 35 25
4 16 36 26
5 17 37 27
6 18 38 28
7 19 39 29
8 20 40 30
Upvotes: 2