Reputation: 53
I'm working with a DataFrame where column1 is a DateTime type and all other columns are Float64 and represent hourly water use. The DataFrame spans 2 years of water use recorded at hourly intervals for every day and looks something like this.
Date | UserID1 | UserID2 | UserID3 | ...
DateTime | Float64 | Float64 | Float64 | ...
---------------------------------------------------------
2017-01-01 00:00:00 | 1.5 | 22.5 | 5.5 | ...
2017-01-01 01:00:00 | 4.5 | 3.2 | 9.12 | ...
.
.
.
2019-12-31 22:00:00 | 4.2 | 7.6 | 8.9 | ...
2029-12-31 23:00:00 | 3.2 | 0.9 | 11.2 | ...
I'd like to use groupby
to sum
water use for all users over each month.
In Pandas I could easily do something like
df.groupby(df.index.month).sum()
I can't figure out an analogy for Julia DataFrames. Im using CSV, DataFrames, and Dates packages. I can loop through each entry in the Date column like
for i in df.Date
if day(i) == 1
.
.
.
But I'd really like to use groupby
because its such a large DataFrame and I have several other functions I'd like to perform other than sum
Upvotes: 2
Views: 1576
Reputation: 42194
Consider the following DataFrame
:
julia> df = DataFrame(date=Date.(2020,rand(1:5,10), rand(1:28, 10)),val=rand(1:20,10))
10×2 DataFrame
│ Row │ date │ val │
│ │ Date │ Int64 │
├─────┼────────────┼───────┤
│ 1 │ 2020-01-07 │ 17 │
│ 2 │ 2020-04-17 │ 2 │
│ 3 │ 2020-01-18 │ 18 │
│ 4 │ 2020-01-01 │ 11 │
│ 5 │ 2020-04-25 │ 16 │
│ 6 │ 2020-05-08 │ 5 │
│ 7 │ 2020-04-10 │ 4 │
│ 8 │ 2020-02-12 │ 10 │
│ 9 │ 2020-04-16 │ 1 │
│ 10 │ 2020-03-16 │ 15 │
This can be grouped using the following function:
julia> groupby(transform(df, :date => x->yearmonth.(x)),:date_function)
GroupedDataFrame with 5 groups based on key: date_function
First Group (3 rows): date_function = (2020, 1)
│ Row │ date │ val │ date_function │
│ │ Date │ Int64 │ Tuple… │
├─────┼────────────┼───────┼───────────────┤
│ 1 │ 2020-01-07 │ 17 │ (2020, 1) │
│ 2 │ 2020-01-18 │ 18 │ (2020, 1) │
│ 3 │ 2020-01-01 │ 11 │ (2020, 1) │
⋮
Last Group (1 row): date_function = (2020, 3)
│ Row │ date │ val │ date_function │
│ │ Date │ Int64 │ Tuple… │
├─────┼────────────┼───────┼───────────────┤
│ 1 │ 2020-03-16 │ 15 │ (2020, 3) │
or with using Query
:
julia> df |>
@groupby(yearmonth(_.date)) |>
@map({YearMonth=key(_), Sum=sum(_.val)}) |>
DataFrame
5×2 DataFrame
│ Row │ YearMonth │ Sum │
│ │ Tuple… │ Int64 │
├─────┼───────────┼───────┤
│ 1 │ (2020, 1) │ 46 │
│ 2 │ (2020, 4) │ 23 │
│ 3 │ (2020, 5) │ 5 │
│ 4 │ (2020, 2) │ 10 │
│ 5 │ (2020, 3) │ 15 │
Upvotes: 4