Group DataFrame by sequential occurrence of values in a column in Julia

Question

Suppose I have the follow DataFrame:

julia> Random.seed!(1)
TaskLocalRNG()

julia> df = DataFrame(data = rand(1:10, 10), gr = rand([0, 1], 10))
10×2 DataFrame
 Row │ data   gr    
     │ Int64  Int64 
─────┼──────────────
   1 │     1      1
   2 │     4      0
   3 │     7      0
   4 │     7      0
   5 │    10      1
   6 │     2      1
   7 │     8      0
   8 │     8      0
   9 │     7      0
  10 │     2      0

What I want is something not only the value of the :gr, but also the occurrences of these values. In this case, the number of groups should be 4:

Group 1 (1 row)
 Row │ data   gr    
     │ Int64  Int64 
─────┼──────────────
   1 │     1      1


Group 2 (3 rows)
 Row │ data   gr    
     │ Int64  Int64 
─────┼──────────────
   2 │     4      0
   3 │     7      0
   4 │     7      0

Group 3 (2 rows)
 Row │ data   gr    
     │ Int64  Int64 
─────┼──────────────
   5 │    10      1
   6 │     2      1

Group 4 (4 rows)
 Row │ data   gr    
     │ Int64  Int64 
─────┼──────────────
   7 │     8      0
   8 │     8      0
   9 │     7      0
  10 │     2      0

If I group by the column :gr, however, I could only get two groups:

julia> groupby(df, :gr)
GroupedDataFrame with 2 groups based on key: gr
First Group (7 rows): gr = 0
 Row │ data   gr    
     │ Int64  Int64 
─────┼──────────────
   1 │     4      0
   2 │     7      0
   3 │     7      0
   4 │     8      0
   5 │     8      0
   6 │     7      0
   7 │     2      0
⋮
Last Group (3 rows): gr = 1
 Row │ data   gr    
     │ Int64  Int64 
─────┼──────────────
   1 │     1      1
   2 │    10      1
   3 │     2      1

How could I implement this in Julia DataFrames.jl? Thanks

versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Apple M1 Max
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, westmere)
Environment:
  JULIA_NUM_THREADS = 1
  JULIA_EDITOR = code

Sundar R · Accepted Answer

Same basic idea as @GEK's answer, with a more vectorized implementation:

julia> edgedetect(col) = [0; abs.(diff(col))] |> cumsum
edgedetect (generic function with 1 method)

julia> edgedetect([0, 1, 1, 1, 0, 0, 1]) |> print 
[0, 1, 1, 1, 2, 2, 3]

abs.(diff(col)) places a 1 wherever the value of column col changes, and a 0 elsewhere. (diff returns n-1 differences given n elements, so we prefix the result with a 0 to maintain column length.) Doing a cumulative sum on this, we get a new column that increases every time the value in the original column changes.

We can then use this function to groupby on a transformed dataframe, like this:

julia> groupby(transform(df, :gr => edgedetect => :gr_edges, copycols = false), :gr_edges) |> print
GroupedDataFrame with 4 groups based on key: gr_edges
Group 1 (1 row): gr_edges = 0
 Row │ data   gr     gr_edges 
     │ Int64  Int64  Int64    
─────┼────────────────────────
   1 │     1      1         0
Group 2 (3 rows): gr_edges = 1
 Row │ data   gr     gr_edges 
     │ Int64  Int64  Int64    
─────┼────────────────────────
   1 │     4      0         1
   2 │     7      0         1
   3 │     7      0         1
Group 3 (2 rows): gr_edges = 2
 Row │ data   gr     gr_edges 
     │ Int64  Int64  Int64    
─────┼────────────────────────
   1 │    10      1         2
   2 │     2      1         2
Group 4 (4 rows): gr_edges = 3
 Row │ data   gr     gr_edges 
     │ Int64  Int64  Int64    
─────┼────────────────────────
   1 │     8      0         3
   2 │     8      0         3
   3 │     7      0         3
   4 │     2      0         3

Group DataFrame by sequential occurrence of values in a column in Julia

Answers (2)

Related Questions