JapanRob
JapanRob

Reputation: 384

R language split data frame into relevant format for graph

I have a data set with 12,000 rows of the following (IP addresses changed, obviously):

Date                        good?   ip
11670 2016-10-31 00:03:30   TRUE  127.0.0.1
11671 2016-10-31 00:03:23   TRUE  127.0.0.1
11672 2016-10-31 00:03:20   TRUE  127.0.0.1
11673 2016-10-31 00:03:11   TRUE  127.0.0.1
11674 2016-10-31 00:03:08   TRUE  127.0.0.1
11675 2016-10-31 00:02:59  FALSE  127.0.0.1
11676 2016-10-31 00:02:53   TRUE  127.0.0.1
11677 2016-10-31 00:02:49  FALSE  127.0.0.1
11678 2016-10-31 00:02:44  FALSE  127.0.0.1
11679 2016-10-31 00:02:40   TRUE  127.0.0.1

So, I've done some things to this, like "Get all ip addresses that have over 100 rows of combined TRUE and FALSE", etc.

I'm trying to take the above and transform it into an easily plowable format where I can sort the "good" bool by count and group it into intervals of 1 hour. Something like:

date                TRUE  FALSE
2016-10-31 00:00:00 342  1010
2016-10-31 01:00:00 544   890

I tried cut.POSIXt(data$date, breaks = "hour"), but that didn't preserve the bool state and also gave me an unplottable object.

I tried lapply(split(votes, cut(votes$date, "hour")), function(x) summary (x[2])), which I found online and just happened to fit my use case, but

1) I don't understand a lot of what is happening beyond cut and split 2) It gives me a list with a date in characters followed by a blob of characters, which I can't plot

What transforms and steps should I be taking here? I tried aggregate, but I always lose TRUE/FALSE distinctions.

The goal is to make a plot that has two colored bars per hour interval, blue for total true and red for total false registered in that hour interval.

Upvotes: 1

Views: 116

Answers (2)

timat
timat

Reputation: 1500

Here's a solution with data.table

library(data.table)
dt <- data.table(df)

change the format corresponding to your date format, in your example, it looks like there is an id before the data, you should delete it before.

dt$date<- as.POSIXct(dt$date,format="%d/%m/%Y %H:%M",  tz = "GMT")

Use cut to generate the hour category

dt$break_hour <- cut(dt$date, breaks="hour")

Use data.table (faster than R fuinction base like aggregate) to create your column

dt <- dt[,list(good = sum(good), bad = sum(!good)), by=break_hour ]

Upvotes: 1

mtoto
mtoto

Reputation: 24198

Here's one approach using dplyr based on your example data, assuming df$Date is a of type POSIXct.

library(dplyr)
library(tidyr)
df %>% 
  group_by(date = format(Date, "%Y-%m-%d"),
           hour = format(Date, "%H"),
           good = as.character(good)) %>%
  summarise(count = n()) %>%
  spread(key = good, value = count)
#        date  hour `FALSE` `TRUE`
#*      <chr> <chr>   <int>  <int>
#1 2016-10-31    00       3      7

If your intention is to plot, you probably don't need the last step of spread(), and can use what comes before to plug it into ggplot2 for example.

Upvotes: 0

Related Questions