Conditional Summing with multiple columns in R

Question

I want to plot points per user versus time but I am not sure what to do to the columns in order to achieve that result. This is what my data looks like:

> head(data, n=3)
points   user       time
25        1      02/22/2017
0         2      02/26/2017
15        3      02/27/2017

> dput(data)
structure(list(points = c(25, 0, 15), user = c(1, 2, 3), time = c("02/22/2017", "02/26/2017", "02/27/2017")), .Names = c("points", "user", "time"), row.names = c(NA, -3L), class = "data.frame")

FYI there are multiple users ids (I think up to 15). However what I want to do is sum the total points per user (then number in the user column corresponds to the user's id number. And then plot those values over time (by day specifically).

This is the code I use to generate the total points per user

library(data.table)
ppu = setkey(setDT(df), user_id)[, list(points=sum(points)), by=list(user_id)]

Which gives the following result:

But I want to find the total points per user per day! I would really appreciate any guidance.

Uwe · Accepted Answer

Please, try (with df as given by the result of the dput() in the Q):

library(data.table)   # version 1.10.4 used
ppu <- setDT(df)[, .(points = sum(points)), by = .(user, time)]

ppu
#   user       time points
#1:    1 02/22/2017     25
#2:    2 02/26/2017      0
#3:    3 02/27/2017     15

This will return user, time in the order they appear in df. If you want to have the result sorted, you have two choices:

E.g., for printing, use

ppu[order(user, time)]
# or
ppu[order(time, user)]

or, if the result should be keyed, try keyby:

ppu <- setDT(df)[, .(points = sum(points)), keyby = .(user, time)]

Some remarks:

Your code snippet uses user_id while your data sample uses user. Also, the data sample includes a column named time which contains dates as character strings but in the text you are using the term "day".
by accepts more than one grouping variable. You even can create expressions on the fly.
As simplification, time doesn't need to be coerced to class Data as long as same dates are being written the same way.
In data.table syntax, .() is an abbreviation for list().
The recent versions of data.table have lifted the requirement to set keys.

In this comment, the OP asked how

to plot the amount of points per user vs time (per day).

This requires some modfications to ppu to work better with ggplot2.

# coerce user to factor to get a discrete colour scale
# only required here because user was given as numeric 
ppu[, user := factor(user)]
# coerce time from character to Date class
# to get a nicely scaled x-axis instead of discrete values
ppu[, time := lubridate::mdy(time)]

Now, points are plotted versus time but with a separate, colour-coded line for each user:

library(ggplot2)
ggplot(ppu, aes(time, points, group = user, colour = user)) + 
  geom_point() + geom_line()

Well, you probably would see lines here if there were enough sample data ...

Conditional Summing with multiple columns in R

Answers (2)

Related Questions