Reputation: 1076
Suppose I have the follow DataFrame:
julia> Random.seed!(1)
TaskLocalRNG()
julia> df = DataFrame(data = rand(1:10, 10), gr = rand([0, 1], 10))
10×2 DataFrame
Row │ data gr
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 4 0
3 │ 7 0
4 │ 7 0
5 │ 10 1
6 │ 2 1
7 │ 8 0
8 │ 8 0
9 │ 7 0
10 │ 2 0
What I want is something not only the value of the :gr
, but also the occurrences of these values. In this case, the number of groups should be 4:
Group 1 (1 row)
Row │ data gr
│ Int64 Int64
─────┼──────────────
1 │ 1 1
Group 2 (3 rows)
Row │ data gr
│ Int64 Int64
─────┼──────────────
2 │ 4 0
3 │ 7 0
4 │ 7 0
Group 3 (2 rows)
Row │ data gr
│ Int64 Int64
─────┼──────────────
5 │ 10 1
6 │ 2 1
Group 4 (4 rows)
Row │ data gr
│ Int64 Int64
─────┼──────────────
7 │ 8 0
8 │ 8 0
9 │ 7 0
10 │ 2 0
If I group by the column :gr
, however, I could only get two groups:
julia> groupby(df, :gr)
GroupedDataFrame with 2 groups based on key: gr
First Group (7 rows): gr = 0
Row │ data gr
│ Int64 Int64
─────┼──────────────
1 │ 4 0
2 │ 7 0
3 │ 7 0
4 │ 8 0
5 │ 8 0
6 │ 7 0
7 │ 2 0
⋮
Last Group (3 rows): gr = 1
Row │ data gr
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 10 1
3 │ 2 1
How could I implement this in Julia DataFrames.jl
? Thanks
versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin19.5.0)
CPU: Apple M1 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, westmere)
Environment:
JULIA_NUM_THREADS = 1
JULIA_EDITOR = code
Upvotes: 3
Views: 307
Reputation: 14735
Same basic idea as @GEK's answer, with a more vectorized implementation:
julia> edgedetect(col) = [0; abs.(diff(col))] |> cumsum
edgedetect (generic function with 1 method)
julia> edgedetect([0, 1, 1, 1, 0, 0, 1]) |> print
[0, 1, 1, 1, 2, 2, 3]
abs.(diff(col))
places a 1 wherever the value of column col
changes, and a 0 elsewhere. (diff
returns n-1 differences given n elements, so we prefix the result with a 0 to maintain column length.) Doing a cumulative sum on this, we get a new column that increases every time the value in the original column changes.
We can then use this function to groupby
on a transformed dataframe, like this:
julia> groupby(transform(df, :gr => edgedetect => :gr_edges, copycols = false), :gr_edges) |> print
GroupedDataFrame with 4 groups based on key: gr_edges
Group 1 (1 row): gr_edges = 0
Row │ data gr gr_edges
│ Int64 Int64 Int64
─────┼────────────────────────
1 │ 1 1 0
Group 2 (3 rows): gr_edges = 1
Row │ data gr gr_edges
│ Int64 Int64 Int64
─────┼────────────────────────
1 │ 4 0 1
2 │ 7 0 1
3 │ 7 0 1
Group 3 (2 rows): gr_edges = 2
Row │ data gr gr_edges
│ Int64 Int64 Int64
─────┼────────────────────────
1 │ 10 1 2
2 │ 2 1 2
Group 4 (4 rows): gr_edges = 3
Row │ data gr gr_edges
│ Int64 Int64 Int64
─────┼────────────────────────
1 │ 8 0 3
2 │ 8 0 3
3 │ 7 0 3
4 │ 2 0 3
Upvotes: 1
Reputation: 59
You can group by a new variable that increments each time :gr
changes value.
For example:
nrow = size(df,1)
gr1 = zeros(Int64, nrow)
for i=2:nrow
gr1[i] = gr1[i-1] + (df.gr[i] != df.gr[i-1])
end
df.gr1 = gr1
Upvotes: 3