Reputation: 1405
I have a Julia DataFrame like so:
│ Row │ date │ users │ posts │ topics │ likes │ pageviews │
│ │ Date │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├──────┼────────────────────┼───────┼───────┼────────┼───────┼───────────┤
│ 1 │ Date("2020-06-16") │ 1 │ 3 │ 4 │ 7 │ 10000 │
│ 2 │ Date("2020-06-15") │ 2 │ 2 │ 5 │ 8 │ 20000 │
│ 3 │ Date("2020-06-14") │ 3 │ 3 │ 6 │ 9 │ 30000 │
I would like a plot of posts vs date, but the daily data is too noisy, so I'd like to take sum the posts for every week and plot that instead? What's the easiest way to achieve that.
Upvotes: 6
Views: 602
Reputation: 42194
Here is my proposal that allows to control which day you use for the start of the week (I use here Monday) and is also robust to missing data (dates can be repeated in the dataset):
df.weekno = Dates.days.(df.date .- Date(2000, 1, 3)) .÷ 7
combine(groupby(df, :weekno), :val => sum)
Note that the performance vs using Week
is 3.5x faster:
julia> @btime transform($df,:date => (d->floor.(d,Week)) => :week);
63.699 μs (68 allocations: 175.47 KiB)
julia> @btime transform($df,:date => (d->Dates.days.(d .- Date(2000, 1, 3)) .÷ 7) => :week);
18.900 μs (74 allocations: 175.64 KiB)
Upvotes: 2
Reputation: 305
in DataFrames you can use groupby
with combine
as follows:
julia> using Statistics, Dates, Pipe;
julia> df = DataFrame(date = range(Date(2000, 01, 01), Date(2020, 01, 01), step = Day(1)));
julia> df.val = rand(nrow(df));
julia> @pipe df |>
transform(_,
:date => ByRow(year) => :year,
:date => ByRow(week) => :week # 1:52
) |>
groupby(_, [:week, :year]) |>
transform(_, :val => mean)
If you want a moving average, you can use the following function
julia> function lagged_mean(x, b)
map(1:length(x)) do i
i < b ? missing : mean(@view x[i-b+1:i])
end
end
julia> lagged_mean(df.val, 7)
Upvotes: 2
Reputation: 1405
The TimeSeries package provides various utilities to work with TimeSeries data.
In this case, you can use the collapse
to convert from daily to weekly data:
julia> using TimeSeries, DataFrames
julia> ta = TimeArray(df.date, df.posts)
1311×1 TimeArray{Int64,1,Date,Array{Int64,1}} 2016-10-19 to 2020-06-16
│ │ A │
├────────────┼───────┤
│ 2016-10-19 │ 1 │
│ 2016-10-20 │ 2 │
│ 2016-10-21 │ 3 │
│ 2016-10-23 │ 4 │
...
julia> weekly = collapse(ta, week, last, sum)
192×1 TimeArray{Int64,1,Date,Array{Int64,1}} 2016-10-23 to 2020-06-16
│ │ A │
├────────────┼───────┤
│ 2016-10-23 │ 10 │
│ 2016-10-28 │ 22 │
│ 2016-11-06 │ 34 │
...
julia> using Gadfly
julia> plot(DataFrame(weekly)[1:end-1,:], x=:timestamp, y=:A, Geom.line(), Guide.ylabel("Weekly sum of Posts"), Guide.xlabel("Week"))
Upvotes: 2