Keno Fischer
Keno Fischer

Reputation: 1405

Plot weekly data from DataFrame of daily data

I have a Julia DataFrame like so:

│ Row  │ date               │ users │ posts │ topics │ likes │ pageviews │
│      │ Date               │ Int64 │ Int64 │ Int64  │ Int64 │ Int64     │
├──────┼────────────────────┼───────┼───────┼────────┼───────┼───────────┤
│ 1    │ Date("2020-06-16") │  1    │  3    │ 4      │ 7     │ 10000     │
│ 2    │ Date("2020-06-15") │  2    │  2    │ 5      │ 8     │ 20000     │
│ 3    │ Date("2020-06-14") │  3    │  3    │ 6      │ 9     │ 30000     │

I would like a plot of posts vs date, but the daily data is too noisy, so I'd like to take sum the posts for every week and plot that instead? What's the easiest way to achieve that.

Upvotes: 6

Views: 602

Answers (3)

Przemyslaw Szufel
Przemyslaw Szufel

Reputation: 42194

Here is my proposal that allows to control which day you use for the start of the week (I use here Monday) and is also robust to missing data (dates can be repeated in the dataset):

df.weekno = Dates.days.(df.date .- Date(2000, 1, 3)) .÷ 7
combine(groupby(df, :weekno), :val => sum)

Note that the performance vs using Week is 3.5x faster:

julia> @btime transform($df,:date => (d->floor.(d,Week)) => :week);
  63.699 μs (68 allocations: 175.47 KiB)
julia> @btime transform($df,:date => (d->Dates.days.(d .- Date(2000, 1, 3)) .÷ 7) => :week);
  18.900 μs (74 allocations: 175.64 KiB)

Upvotes: 2

genauguy
genauguy

Reputation: 305

in DataFrames you can use groupby with combine as follows:

julia> using Statistics, Dates, Pipe;
julia> df = DataFrame(date = range(Date(2000, 01, 01), Date(2020, 01, 01), step = Day(1)));
julia> df.val = rand(nrow(df));
julia> @pipe df |>
           transform(_,
               :date => ByRow(year) => :year,
               :date => ByRow(week) => :week # 1:52
           ) |>
           groupby(_, [:week, :year]) |>
           transform(_, :val => mean)

If you want a moving average, you can use the following function

julia> function lagged_mean(x, b)
           map(1:length(x)) do i
               i < b ? missing : mean(@view x[i-b+1:i])
           end
       end

julia> lagged_mean(df.val, 7)

Upvotes: 2

Keno Fischer
Keno Fischer

Reputation: 1405

The TimeSeries package provides various utilities to work with TimeSeries data. In this case, you can use the collapse to convert from daily to weekly data:

julia> using TimeSeries, DataFrames

julia> ta = TimeArray(df.date, df.posts)
1311×1 TimeArray{Int64,1,Date,Array{Int64,1}} 2016-10-19 to 2020-06-16
│            │ A     │
├────────────┼───────┤
│ 2016-10-19 │ 1     │
│ 2016-10-20 │ 2     │
│ 2016-10-21 │ 3     │
│ 2016-10-23 │ 4     │
...

julia> weekly = collapse(ta, week, last, sum)
192×1 TimeArray{Int64,1,Date,Array{Int64,1}} 2016-10-23 to 2020-06-16
│            │ A     │
├────────────┼───────┤
│ 2016-10-23 │ 10    │
│ 2016-10-28 │ 22    │
│ 2016-11-06 │ 34    │
...

julia> using Gadfly
julia> plot(DataFrame(weekly)[1:end-1,:], x=:timestamp, y=:A, Geom.line(), Guide.ylabel("Weekly sum of Posts"), Guide.xlabel("Week"))

Upvotes: 2

Related Questions