Reputation: 384
I have a data set with 12,000 rows of the following (IP addresses changed, obviously):
Date good? ip
11670 2016-10-31 00:03:30 TRUE 127.0.0.1
11671 2016-10-31 00:03:23 TRUE 127.0.0.1
11672 2016-10-31 00:03:20 TRUE 127.0.0.1
11673 2016-10-31 00:03:11 TRUE 127.0.0.1
11674 2016-10-31 00:03:08 TRUE 127.0.0.1
11675 2016-10-31 00:02:59 FALSE 127.0.0.1
11676 2016-10-31 00:02:53 TRUE 127.0.0.1
11677 2016-10-31 00:02:49 FALSE 127.0.0.1
11678 2016-10-31 00:02:44 FALSE 127.0.0.1
11679 2016-10-31 00:02:40 TRUE 127.0.0.1
So, I've done some things to this, like "Get all ip addresses that have over 100 rows of combined TRUE and FALSE", etc.
I'm trying to take the above and transform it into an easily plowable format where I can sort the "good" bool by count and group it into intervals of 1 hour. Something like:
date TRUE FALSE
2016-10-31 00:00:00 342 1010
2016-10-31 01:00:00 544 890
I tried cut.POSIXt(data$date, breaks = "hour")
, but that didn't preserve the bool state and also gave me an unplottable object.
I tried lapply(split(votes, cut(votes$date, "hour")), function(x) summary (x[2]))
, which I found online and just happened to fit my use case, but
1) I don't understand a lot of what is happening beyond cut and split 2) It gives me a list with a date in characters followed by a blob of characters, which I can't plot
What transforms and steps should I be taking here? I tried aggregate, but I always lose TRUE/FALSE distinctions.
The goal is to make a plot that has two colored bars per hour interval, blue for total true and red for total false registered in that hour interval.
Upvotes: 1
Views: 116
Reputation: 1500
Here's a solution with data.table
library(data.table)
dt <- data.table(df)
change the format corresponding to your date format, in your example, it looks like there is an id before the data, you should delete it before.
dt$date<- as.POSIXct(dt$date,format="%d/%m/%Y %H:%M", tz = "GMT")
Use cut to generate the hour category
dt$break_hour <- cut(dt$date, breaks="hour")
Use data.table (faster than R fuinction base like aggregate
) to create your column
dt <- dt[,list(good = sum(good), bad = sum(!good)), by=break_hour ]
Upvotes: 1
Reputation: 24198
Here's one approach using dplyr
based on your example data, assuming df$Date
is a of type POSIXct
.
library(dplyr)
library(tidyr)
df %>%
group_by(date = format(Date, "%Y-%m-%d"),
hour = format(Date, "%H"),
good = as.character(good)) %>%
summarise(count = n()) %>%
spread(key = good, value = count)
# date hour `FALSE` `TRUE`
#* <chr> <chr> <int> <int>
#1 2016-10-31 00 3 7
If your intention is to plot, you probably don't need the last step of spread()
, and can use what comes before to plug it into ggplot2
for example.
Upvotes: 0