Reputation: 75
I want to plot points per user versus time but I am not sure what to do to the columns in order to achieve that result. This is what my data looks like:
> head(data, n=3)
points user time
25 1 02/22/2017
0 2 02/26/2017
15 3 02/27/2017
> dput(data)
structure(list(points = c(25, 0, 15), user = c(1, 2, 3), time = c("02/22/2017", "02/26/2017", "02/27/2017")), .Names = c("points", "user", "time"), row.names = c(NA, -3L), class = "data.frame")
FYI there are multiple users ids (I think up to 15). However what I want to do is sum the total points per user (then number in the user column corresponds to the user's id number. And then plot those values over time (by day specifically).
This is the code I use to generate the total points per user
library(data.table)
ppu = setkey(setDT(df), user_id)[, list(points=sum(points)), by=list(user_id)]
Which gives the following result:
But I want to find the total points per user per day! I would really appreciate any guidance.
Upvotes: 0
Views: 295
Reputation: 42544
Please, try (with df
as given by the result of the dput()
in the Q):
library(data.table) # version 1.10.4 used
ppu <- setDT(df)[, .(points = sum(points)), by = .(user, time)]
ppu
# user time points
#1: 1 02/22/2017 25
#2: 2 02/26/2017 0
#3: 3 02/27/2017 15
This will return user
, time
in the order they appear in df
. If you want to have the result sorted, you have two choices:
E.g., for printing, use
ppu[order(user, time)]
# or
ppu[order(time, user)]
or, if the result should be keyed, try keyby
:
ppu <- setDT(df)[, .(points = sum(points)), keyby = .(user, time)]
Some remarks:
user_id
while your data sample uses user
. Also, the data sample includes a column named time
which contains dates as character strings but in the text you are using the term "day".by
accepts more than one grouping variable. You even can create expressions on the fly.time
doesn't need to be coerced to class Data
as long as same dates are being written the same way.data.table
syntax, .()
is an abbreviation for list()
.data.table
have lifted the requirement to set keys.In this comment, the OP asked how
to plot the amount of points per user vs time (per day).
This requires some modfications to ppu
to work better with ggplot2
.
# coerce user to factor to get a discrete colour scale
# only required here because user was given as numeric
ppu[, user := factor(user)]
# coerce time from character to Date class
# to get a nicely scaled x-axis instead of discrete values
ppu[, time := lubridate::mdy(time)]
Now, points
are plotted versus time
but with a separate, colour-coded line for each user
:
library(ggplot2)
ggplot(ppu, aes(time, points, group = user, colour = user)) +
geom_point() + geom_line()
Well, you probably would see lines here if there were enough sample data ...
Upvotes: 3
Reputation: 1
First you need to convert your dates to a nice format, for that I'd suggest you use library(lubridate)
like this:
data$day <- mdy(data$day)
Then sum the number of points for each user for each day:
library(plyr)
pts_user_day <- ddply(data, .(user, day), summarise, pts_day = sum(points))
Finally plot all of this over time:
library(ggplot2)
ggplot(pts_user_day, aes(x=day, y=pts_day, col=factor(user))) + geom_line() + scale_x_date()
Hope that helps!
Upvotes: 0