James
James

Reputation: 137

Split a df into a list with groups of values withouts nas

I am working with the scale data from GeomLine objects. In this I am trying to find the data for the disjoint lines.

The data might look something like this

structure(list(x = c(96, 150, 150, 157, 163, 165, 165, 165, 166, 
    167, 168, 170, 178, 178, 184, 213), y = c(NA, 49, NA, NA, 
    NA, 75, NA, 45, 50, NA, 55, 56.2, 55, 57, 50, NA)), class = "data.frame", 
    row.names = c(NA, -16L))
#>      x    y
#> 1   96   NA
#> 2  150 49.0
#> 3  150   NA
#> 4  157   NA
#> 5  163   NA
#> 6  165 75.0
#> 7  165   NA
#> 8  165 45.0
#> 9  166 50.0
#> 10 167   NA
#> 11 168 55.0
#> 12 170 56.2
#> 13 178 55.0
#> 14 178 57.0
#> 15 184 50.0
#> 16 213   NA

I need to find the groups of this data that have at least two sequential rows without NAs. Then turn that into a list of df with these values.

I have gotten it to look something like this:

dplyr::filter(tidyr::drop_na(dplyr::mutate(structure(list(x = c(96, 
    150, 150, 157, 163, 165, 165, 165, 166, 167, 168, 170, 178, 
    178, 184, 213), y = c(NA, 49, NA, NA, NA, 75, NA, 45, 50, 
    NA, 55, 56.2, 55, 57, 50, NA)), class = "data.frame", row.names = c(NA, 
    -16L)), row_id = dplyr::row_number()), y), row_id == dplyr::lead(row_id) - 
    1 | row_id == dplyr::lag(row_id) + 1)
#>     x    y row_id
#> 1 165 45.0      8
#> 2 166 50.0      9
#> 3 168 55.0     11
#> 4 170 56.2     12
#> 5 178 55.0     13
#> 6 178 57.0     14
#> 7 184 50.0     15

It is the actual transformation that I am struggling a little bit with.

I can't seem to find a way to do it. I have tried doing more mutations to get it to a point where I could use group_split yet I can't seem to get it to that point.

This is what I would want the end result to look like though.

list(data.frame(x = c(165, 166), y = c(45, 50)), data.frame(x = c(168, 
    170, 178, 178, 184), y = c(55, 56.2, 55, 57, 50)))
#> [[1]]
#>     x  y
#> 1 165 45
#> 2 166 50
#> 
#> [[2]]
#>     x    y
#> 1 168 55.0
#> 2 170 56.2
#> 3 178 55.0
#> 4 178 57.0
#> 5 184 50.0

EDIT:

Thanks to help from answers and in particular @SALAR I have come up with this solution using only dplyr and base R

lineData |> 
          group_by(group = cumsum(is.na(y))) |>
          filter(!(is.na(y) & n() > 1)) |> 
          group_split() |>
          Filter(function(x) nrow(x) >= 2, x = _)

Upvotes: 5

Views: 77

Answers (3)

SAL
SAL

Reputation: 2140

You can do it using tidyverse family:

df %>% 
  # Group the data frame by consecutive NAs in the `y` column
  group_by(group = cumsum(is.na(y))) %>%
  # Remove groups that have more than one NA in the `y` column
  filter(!(is.na(y) & n() > 1)) %>% 
  # Split the data frame into a list of data frames based on the groups
  split(.$group) %>% 
  # Keep only those data frames with more than two rows
  keep(~nrow(.) >= 2)

output:

$`5`
# A tibble: 2 × 3
# Groups:   group [1]
      x     y group
  <dbl> <dbl> <int>
1   165    45     5
2   166    50     5

$`6`
# A tibble: 5 × 3
# Groups:   group [1]
      x     y group
  <dbl> <dbl> <int>
1   168  55       6
2   170  56.2     6
3   178  55       6
4   178  57       6
5   184  50       6

Upvotes: 2

akrun
akrun

Reputation: 887691

Using data.table

library(data.table)
d1 <- setDT(df1)[, if(.N > 1 && all(!is.na(y))) .(x, y), .(grp = rleid(is.na(y)))]
 split(d1[, -1, with = FALSE], d1$grp)

-output

$`6`
     x  y
1: 165 45
2: 166 50

$`8`
     x    y
1: 168 55.0
2: 170 56.2
3: 178 55.0
4: 178 57.0
5: 184 50.0

Upvotes: 1

Allan Cameron
Allan Cameron

Reputation: 174393

You could do

library(dplyr)

df %>% 
  mutate(rle = data.table::rleid(is.na(y))) %>%
  group_by(rle) %>%
  filter(!is.na(y) & n() > 1) %>%
  ungroup() %>%
  split(.$rle) %>%
  lapply(`[`, 1:2)
#> $`6`
#> # A tibble: 2 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1   165    45
#> 2   166    50
#> 
#> $`8`
#> # A tibble: 5 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1   168  55  
#> 2   170  56.2
#> 3   178  55  
#> 4   178  57  
#> 5   184  50

Created on 2023-02-07 with reprex v2.0.2

Upvotes: 1

Related Questions