Reputation: 601
I have data that I work with, which is count data i.e. every date+time combination represents a data point. So my current data frame looks as follows:
DATE TIME
1 2014-02-15 15:02
2 2014-02-15 15:12
3 2014-04-15 02:02
4 2014-05-15 11:02
5 2014-06-15 15:42
6 2014-06-15 16:02
....
Now I want to have a new DF that counts how many data points there are per hour for a particular date. Something like below:
DATE HOUR COUNT
1 2014-02-15 15 2
2 2014-04-15 02 1
3 2014-05-15 11 1
4 2014-06-15 15 1
5 2014-06-15 16 1
....
I want to do this so that I can make a boxplot with x = hour of the day, y = number of data points (over a year). Tried to do it with nested for loops, but it didn't work.
EDIT: and if possible, the date/hour combination where there is no data point should be in the data frame, but with COUNT = 0.
Upvotes: 0
Views: 167
Reputation:
Is that what you are looking for?
options(stringsAsFactors = F)
data = read.table(text =
" 1 2014-02-15 15:02
2 2014-02-15 15:12
3 2014-04-15 02:02
4 2014-05-15 11:02
5 2014-06-15 15:42
6 2014-06-15 16:02")
colnames(data) = c("index", "date", "time")
table(data$date)
# 2014-02-15 2014-04-15 2014-05-15 2014-06-15
# 2 1 1 2
table(data$date, data$time)
fz = table(data$date, substr(data$time, 1,2))
print(fz)
# 02 11 15 16
# 2014-02-15 0 0 2 0
# 2014-04-15 1 0 0 0
# 2014-05-15 0 1 0 0
# 2014-06-15 0 0 1 1
If you want to reshape your data you can do the following:
library(reshape)
otherFormat = melt(fz)
colnames(otherFormat) = c("date","hour", "frequency")
print(otherFormat)
# date hour frequency
# 1 2014-02-15 2 0
# 2 2014-04-15 2 1
# 3 2014-05-15 2 0
# 4 2014-06-15 2 0
# 5 2014-02-15 11 0
# 6 2014-04-15 11 0
# 7 2014-05-15 11 1
# 8 2014-06-15 11 0
# 9 2014-02-15 15 2
# 10 2014-04-15 15 0
# 11 2014-05-15 15 0
# 12 2014-06-15 15 1
# 13 2014-02-15 16 0
# 14 2014-04-15 16 0
# 15 2014-05-15 16 0
# 16 2014-06-15 16 1
Upvotes: 1
Reputation: 23574
Additional option would be the following. First, you create a column for hour in mutate()
. Then, you count how many data points exist by DATE
and hour
in count()
. Once you ungroup the data, you join two data frames to create your desired outcome. The expand.grid()
part creates all combination of DATE and hour (00 to 23). Since you have 02 for 2, I used c(paste0("0", 0:9), 10:23))
. Finally, you replace NA with 0 in the final mutate()
.
library(dplyr)
library(stringi)
library(data.table)
mutate(mydf, DATE, hour = stri_extract_first(TIME, regex = "\\d+")) %>%
count(DATE, hour) %>%
ungroup %>%
right_join(expand.grid(DATE = unique(.$DATE),
hour = c(paste0("0", 0:9), 10:23))) %>%
mutate(n = replace(n, is.na(n), 0))
# A bit of outcome
# DATE hour n
#1 2014-02-15 00 0
#2 2014-04-15 00 0
#3 2014-05-15 00 0
#4 2014-06-15 00 0
#5 2014-02-15 01 0
Using data.table, you could do the same operation. You create a column for hour
and count the number of data points by DATE
and hour
. Then, you want to merge temp
with a data table which has all combination of DATE and hours (00 to 23). You can create the data table using CJ()
. Once you finish the merge process, You replace NA
with 0
in the column for count (total
).
setDT(mydf)[, hour := stri_extract_first(TIME, regex = "\\d+")][,
list(total = .N), by = list(DATE, hour)] -> temp
merge(temp,
CJ(DATE = unique(mydf$DATE), hour = c(paste0("0", 0:9), 10:23)),
by = c("DATE", "hour"), all = TRUE)[, total := replace(total, is.na(total), 0)][]
# DATE hour total
# 1: 2014-02-15 02 0
# 2: 2014-02-15 11 0
# 3: 2014-02-15 15 2
# 4: 2014-02-15 16 0
# 5: 2014-02-15 00 0
DATA
mydf <- structure(list(DATE = structure(c(16116, 16116, 16175, 16205,
16236, 16236), class = "Date"), TIME = structure(c(3L, 4L, 1L,
2L, 5L, 6L), .Label = c("02:02", "11:02", "15:02", "15:12", "15:42",
"16:02"), class = "factor")), class = "data.frame", .Names = c("DATE",
"TIME"), row.names = c(NA, -6L))
Upvotes: 1
Reputation: 9133
IMO, the most readable way:
edited to answer your updated question
library(dplyr)
library(stringr)
df <- date.data %>%
group_by(
DATE = as.Date(DATE),
HOUR = as.numeric(str_sub(TIME, 1, 2))
) %>%
tally
# create a data frame with all dates/hours
expand.grid(
# include all dates from first to last
DATE = seq.Date(min(df$DATE), max(df$DATE), "day"),
HOUR = 0:23
) %>%
arrange(DATE) %>%
left_join(df, by = c("DATE", "HOUR"))
Upvotes: 1
Reputation: 6749
You can do this in a few ways, but I suspect that the easiest way is for you to use table
. Using 'table', you can return a frequency on the dates. This is basically just a count of the dates in the data frame.
You can do the same thing after extracting the hour - you can even nest it by doing table(DF$DATE,DF$HOUR)
. Using as.data.frame
would get you a listing somewhat like what you are looking for.
EDITED TO ADD: In response to your edit of the question, you can use a factor
level to get the zero levels in the table
statement. table
respects your factor levels by including them in it's output, even if it is not found in the input (in fact, I believe that table
coerces the input into factors on the back side).
Example code:
# Set options and load example data
options(stringsAsFactors = FALSE)
date.data <- data.frame(DATE = c("2014-02-15","2014-02-15","2014-04-15","2014-05-15","2014-06-15","2014-06-15"),
TIME = c("15:02","15:12","02:02","11:02","15:42","16:02"))
# Extract the hour
date.data$HOUR <- sapply(X = strsplit(x = date.data$TIME,split = ":"),FUN = `[[`,1)
# Now, set the hours as a factor level - this will allow table() to fill the data in as you are requesting
date.data$HOUR <- factor(x = date.data$HOUR,
levels = c("00","01","02","03","04","05",
"06","07","08","09","10","11",
"12","13","14","15","16","17",
"18","19","20","21","22","23"),
labels = c("00","01","02","03","04","05",
"06","07","08","09","10","11",
"12","13","14","15","16","17",
"18","19","20","21","22","23"))
# Obtain the first table of interest
as.data.frame(table(date.data$DATE))
Var1 Freq
1 2014-02-15 2
2 2014-04-15 1
3 2014-05-15 1
4 2014-06-15 2
# And the second table
as.data.frame(table(date.data$DATE,date.data$HOUR))
Var1 Var2 Freq
1 2014-02-15 00 0
2 2014-04-15 00 0
3 2014-05-15 00 0
4 2014-06-15 00 0
5 2014-02-15 01 0
6 2014-04-15 01 0
7 2014-05-15 01 0
8 2014-06-15 01 0
....
Upvotes: 0