Reputation: 515
I've the following dataset:
https://app.box.com/s/au58xaw60r1hyeek5cua6q20byumgvmj
I want to create a density plot based on the time of the day. Here is what I've done so far:
library("ggplot2")
library("scales")
library("lubridate")
timestamp_df$timestamp_time <- format(ymd_hms(hn_tweets$timestamp), "%H:%M:%S")
ggplot(timestamp_df, aes(timestamp_time)) +
geom_density(aes(fill = ..count..)) +
scale_x_datetime(breaks = date_breaks("2 hours"),labels=date_format("%H:%M"))
It gives the following error:
Error: Invalid input: time_trans works with objects of class POSIXct only
If I convert that to POSIXct
, it adds dates to the data.
Update 1
The following converted data to 'NA'
timestamp_df$timestamp_time <- as.POSIXct(timestamp_df$timestamp_time, format = "%H:%M%:%S", tz = "UTC"
Update 2
Following is what I want to achieve:
Upvotes: 1
Views: 4014
Reputation: 258
One problem with the solutions posted here is that they ignore the fact that this data is circular/polar (i.e. 00hrs == 24hrs). You can see on the plots on the other answer that the ends of the charts dont match up with each other. This wont make too much of a difference with this particular dataset, but for events that happen near midnight, this could be an extremely biased estimator of density. Here's my solution, taking into account the circular nature of time data:
# modified code from https://freakonometrics.hypotheses.org/2239
library(dplyr)
library(ggplot2)
library(lubridate)
library(circular)
df = read.csv("data.csv")
datetimes = df$timestamp %>%
lubridate::parse_date_time("%m/%d/%Y %h:%M")
times_in_decimal = lubridate::hour(datetimes) + lubridate::minute(datetimes) / 60
times_in_radians = 2 * pi * (times_in_decimal / 24)
# Doing this just for bandwidth estimation:
basic_dens = density(times_in_radians, from = 0, to = 2 * pi)
res = circular::density.circular(circular::circular(times_in_radians,
type = "angle",
units = "radians",
rotation = "clock"),
kernel = "wrappednormal",
bw = basic_dens$bw)
time_pdf = data.frame(time = as.numeric(24 * (2 * pi + res$x) / (2 * pi)), # Convert from radians back to 24h clock
likelihood = res$y)
p = ggplot(time_pdf) +
geom_area(aes(x = time, y = likelihood), fill = "#619CFF") +
scale_x_continuous("Hour of Day", labels = 0:24, breaks = 0:24) +
scale_y_continuous("Likelihood of Data") +
theme_classic()
Note that the values and slopes of the density plot match up at the 00h and 24h points.
Upvotes: 5
Reputation: 19756
Here is one approach:
library(ggplot2)
library(lubridate)
library(scales)
df <- read.csv("data.csv") #given in OP
convert character to POSIXct
df$timestamp <- as.POSIXct(strptime(df$timestamp, "%m/%d/%Y %H:%M", tz = "UTC"))
library(hms)
extract hour and minute:
df$time <- hms::hms(second(df$timestamp), minute(df$timestamp), hour(df$timestamp))
convert to POSIXct
again since ggplot does not work with class hms
.
df$time <- as.POSIXct(df$time)
ggplot(df, aes(time)) +
geom_density(fill = "red", alpha = 0.5) + #also play with adjust such as adjust = 0.5
scale_x_datetime(breaks = date_breaks("2 hours"), labels=date_format("%H:%M"))
to plot it scaled to 1:
ggplot(df) +
geom_density( aes(x = time, y = ..scaled..), fill = "red", alpha = 0.5) +
scale_x_datetime(breaks = date_breaks("2 hours"), labels=date_format("%H:%M"))
where ..scaled..
is a computed variable for stat_density
made during plot creation.
Upvotes: 1