Ahmed El-Gabbas
Ahmed El-Gabbas

Reputation: 526

Split dataframe based on value of one column in R

I need to split a large tibble object based on the value of one of the columns. In the example below, I calculate the difference between each subsequent values of column C. I need to split the tibble into a list of tibbles when the difference exceed a certain value (here D > 2).

require(dplyr)
TT <- tibble(A = 1:20, B = 21:40, C = c(1:5, 8:11, 18:20, 23:30))
TT <- TT %>% 
  mutate(D = C - dplyr::lag(C),
         E = D > 2)
TT
# A tibble: 20 x 5
       A     B     C     D E    
   <int> <int> <int> <int> <lgl>
 1     1    21     1    NA NA   
 2     2    22     2     1 FALSE
 3     3    23     3     1 FALSE
 4     4    24     4     1 FALSE
 5     5    25     5     1 FALSE
 6     6    26     8     3 TRUE 
 7     7    27     9     1 FALSE
 8     8    28    10     1 FALSE
 9     9    29    11     1 FALSE
10    10    30    18     7 TRUE 
11    11    31    19     1 FALSE
12    12    32    20     1 FALSE
13    13    33    23     3 TRUE 
14    14    34    24     1 FALSE
15    15    35    25     1 FALSE
16    16    36    26     1 FALSE
17    17    37    27     1 FALSE
18    18    38    28     1 FALSE
19    19    39    29     1 FALSE
20    20    40    30     1 FALSE

The output should be a list of four tibbles, like this:

# A tibble: 5 x 3
      A     B     C
  <int> <int> <int>
1     1    21     1
2     2    22     2
3     3    23     3
4     4    24     4
5     5    25     5

# A tibble: 4 x 3
      A     B     C
  <int> <int> <int>
1     6    26     8
2     7    27     9
3     8    28    10
4     9    29    11

# A tibble: 3 x 3
      A     B     C
  <int> <int> <int>
1    10    30    18
2    11    31    19
3    12    32    20

# A tibble: 8 x 3
      A     B     C
  <int> <int> <int>
1    13    33    23
2    14    34    24
3    15    35    25
4    16    36    26
5    17    37    27
6    18    38    28
7    19    39    29
8    20    40    30

Thanks

Upvotes: 1

Views: 1595

Answers (4)

SEAnalyst
SEAnalyst

Reputation: 1211

Another option is to use indexing to split on TRUE within an lapply(), like this:

TT$E[is.na(TT$D)]<-FALSE #Set NA's to FALSE

#index vector of TRUEs, and length
Ts<-c(which(TT$E),length(TT$E)+1) 

#lappy with indexing on True
lapply(1:length(Ts), function(x){
  A_index <- max(1,Ts[x-1])
  B_index <- min(length(TT$E),Ts[x]-1)
  TT[A_index:B_index,1:3])      })

Result:

     [[1]]
# A tibble: 5 x 3
      A     B     C
  <int> <int> <int>
1     1    21     1
2     2    22     2
3     3    23     3
4     4    24     4
5     5    25     5

[[2]]
# A tibble: 4 x 3
      A     B     C
  <int> <int> <int>
1     6    26     8
2     7    27     9
3     8    28    10
4     9    29    11

[[3]]
# A tibble: 3 x 3
      A     B     C
  <int> <int> <int>
1    10    30    18
2    11    31    19
3    12    32    20

[[4]]
# A tibble: 8 x 3
      A     B     C
  <int> <int> <int>
1    13    33    23
2    14    34    24
3    15    35    25
4    16    36    26
5    17    37    27
6    18    38    28
7    19    39    29
8    20    40    30

Upvotes: 0

Onyambu
Onyambu

Reputation: 79338

Another way is to directly use compute the difference within the split function:

split(TT, ~cumsum(c(1, diff(C)) > 1))
$`0`
# A tibble: 5 x 3
      A     B     C
  <int> <int> <int>
1     1    21     1
2     2    22     2
3     3    23     3
4     4    24     4
5     5    25     5

$`1`
# A tibble: 4 x 3
      A     B     C
  <int> <int> <int>
1     6    26     8
2     7    27     9
3     8    28    10
4     9    29    11

$`2`
# A tibble: 3 x 3
      A     B     C
  <int> <int> <int>
1    10    30    18
2    11    31    19
3    12    32    20

$`3`
# A tibble: 8 x 3
      A     B     C
  <int> <int> <int>
1    13    33    23
2    14    34    24
3    15    35    25
4    16    36    26
5    17    37    27
6    18    38    28
7    19    39    29
8    20    40    30

Upvotes: 0

Anoushiravan R
Anoushiravan R

Reputation: 21938

We can also do this as an alternative:

library(dplyr)
library(purrr)

TT %>% 
  mutate(D = C - dplyr::lag(C, default = 0),
         E = cumsum(+(D > 2))) %>%
  split(.$E) %>%
  map(~ .x %>%
        select(A:C))

$`0`
# A tibble: 5 x 3
      A     B     C
  <int> <int> <int>
1     1    21     1
2     2    22     2
3     3    23     3
4     4    24     4
5     5    25     5

$`1`
# A tibble: 4 x 3
      A     B     C
  <int> <int> <int>
1     6    26     8
2     7    27     9
3     8    28    10
4     9    29    11

$`2`
# A tibble: 3 x 3
      A     B     C
  <int> <int> <int>
1    10    30    18
2    11    31    19
3    12    32    20

$`3`
# A tibble: 8 x 3
      A     B     C
  <int> <int> <int>
1    13    33    23
2    14    34    24
3    15    35    25
4    16    36    26
5    17    37    27
6    18    38    28
7    19    39    29
8    20    40    30

Upvotes: 1

akrun
akrun

Reputation: 887851

We can create a grouping column with cumsum, select the columns of interest and use group_split to split the dataset into a list of datasets

library(dplyr)
library(tidyr)
TT %>%
     mutate(grp = cumsum(replace_na(E, FALSE))) %>% 
     select(A:C, grp) %>%
     group_split(grp, .keep = FALSE)

-ouptut

[[1]]
# A tibble: 5 x 3
      A     B     C
  <int> <int> <int>
1     1    21     1
2     2    22     2
3     3    23     3
4     4    24     4
5     5    25     5

[[2]]
# A tibble: 4 x 3
      A     B     C
  <int> <int> <int>
1     6    26     8
2     7    27     9
3     8    28    10
4     9    29    11

[[3]]
# A tibble: 3 x 3
      A     B     C
  <int> <int> <int>
1    10    30    18
2    11    31    19
3    12    32    20

[[4]]
# A tibble: 8 x 3
      A     B     C
  <int> <int> <int>
1    13    33    23
2    14    34    24
3    15    35    25
4    16    36    26
5    17    37    27
6    18    38    28
7    19    39    29
8    20    40    30

Upvotes: 2

Related Questions