Shark167
Shark167

Reputation: 601

Make count data frame

I have data that I work with, which is count data i.e. every date+time combination represents a data point. So my current data frame looks as follows:

  DATE        TIME
1 2014-02-15  15:02
2 2014-02-15  15:12
3 2014-04-15  02:02
4 2014-05-15  11:02
5 2014-06-15  15:42
6 2014-06-15  16:02
....

Now I want to have a new DF that counts how many data points there are per hour for a particular date. Something like below:

  DATE        HOUR    COUNT
1 2014-02-15  15      2
2 2014-04-15  02      1
3 2014-05-15  11      1
4 2014-06-15  15      1
5 2014-06-15  16      1
....

I want to do this so that I can make a boxplot with x = hour of the day, y = number of data points (over a year). Tried to do it with nested for loops, but it didn't work.


EDIT: and if possible, the date/hour combination where there is no data point should be in the data frame, but with COUNT = 0.

Upvotes: 0

Views: 167

Answers (4)

user5363218
user5363218

Reputation:

Is that what you are looking for?

options(stringsAsFactors = F)

data = read.table(text  = 
"                  1 2014-02-15  15:02
                   2 2014-02-15  15:12
                   3 2014-04-15  02:02
                   4 2014-05-15  11:02
                   5 2014-06-15  15:42
                   6 2014-06-15  16:02")


colnames(data) = c("index", "date", "time")

table(data$date)

 # 2014-02-15 2014-04-15 2014-05-15 2014-06-15 
 #     2          1          1          2 

table(data$date, data$time)

fz = table(data$date, substr(data$time, 1,2))
print(fz)   

 #            02 11 15 16
 # 2014-02-15  0  0  2  0
 # 2014-04-15  1  0  0  0
 # 2014-05-15  0  1  0  0
 # 2014-06-15  0  0  1  1

If you want to reshape your data you can do the following:

library(reshape)

otherFormat = melt(fz)
colnames(otherFormat) = c("date","hour", "frequency")

print(otherFormat)

#          date hour frequency
# 1  2014-02-15    2         0
# 2  2014-04-15    2         1
# 3  2014-05-15    2         0
# 4  2014-06-15    2         0
# 5  2014-02-15   11         0
# 6  2014-04-15   11         0
# 7  2014-05-15   11         1
# 8  2014-06-15   11         0
# 9  2014-02-15   15         2
# 10 2014-04-15   15         0
# 11 2014-05-15   15         0
# 12 2014-06-15   15         1
# 13 2014-02-15   16         0
# 14 2014-04-15   16         0
# 15 2014-05-15   16         0
# 16 2014-06-15   16         1

Upvotes: 1

jazzurro
jazzurro

Reputation: 23574

Additional option would be the following. First, you create a column for hour in mutate(). Then, you count how many data points exist by DATE and hour in count(). Once you ungroup the data, you join two data frames to create your desired outcome. The expand.grid() part creates all combination of DATE and hour (00 to 23). Since you have 02 for 2, I used c(paste0("0", 0:9), 10:23)). Finally, you replace NA with 0 in the final mutate().

library(dplyr)
library(stringi)
library(data.table)

mutate(mydf, DATE, hour = stri_extract_first(TIME, regex = "\\d+")) %>%
count(DATE, hour) %>%
ungroup %>%
right_join(expand.grid(DATE = unique(.$DATE),
                       hour = c(paste0("0", 0:9), 10:23))) %>%
mutate(n = replace(n, is.na(n), 0))

# A bit of outcome
#         DATE hour n
#1  2014-02-15   00 0
#2  2014-04-15   00 0
#3  2014-05-15   00 0
#4  2014-06-15   00 0
#5  2014-02-15   01 0

Using data.table, you could do the same operation. You create a column for hour and count the number of data points by DATE and hour. Then, you want to merge temp with a data table which has all combination of DATE and hours (00 to 23). You can create the data table using CJ(). Once you finish the merge process, You replace NA with 0 in the column for count (total).

setDT(mydf)[, hour := stri_extract_first(TIME, regex = "\\d+")][,
            list(total = .N), by = list(DATE, hour)] -> temp

merge(temp,
      CJ(DATE = unique(mydf$DATE), hour = c(paste0("0", 0:9), 10:23)),
      by = c("DATE", "hour"), all = TRUE)[, total := replace(total, is.na(total), 0)][]

#          DATE hour total
# 1: 2014-02-15   02     0
# 2: 2014-02-15   11     0
# 3: 2014-02-15   15     2
# 4: 2014-02-15   16     0
# 5: 2014-02-15   00     0

DATA

mydf <- structure(list(DATE = structure(c(16116, 16116, 16175, 16205, 
16236, 16236), class = "Date"), TIME = structure(c(3L, 4L, 1L, 
2L, 5L, 6L), .Label = c("02:02", "11:02", "15:02", "15:12", "15:42", 
"16:02"), class = "factor")), class = "data.frame", .Names = c("DATE", 
"TIME"), row.names = c(NA, -6L))

Upvotes: 1

davechilders
davechilders

Reputation: 9133

IMO, the most readable way:

edited to answer your updated question

library(dplyr)
library(stringr)

df <- date.data %>%
  group_by(
    DATE = as.Date(DATE), 
    HOUR = as.numeric(str_sub(TIME, 1, 2))
    ) %>%
  tally 

# create a data frame with all dates/hours
expand.grid(
  # include all dates from first to last
  DATE = seq.Date(min(df$DATE), max(df$DATE), "day"),
  HOUR = 0:23
) %>% 
  arrange(DATE) %>%
  left_join(df, by = c("DATE", "HOUR"))

Upvotes: 1

TARehman
TARehman

Reputation: 6749

You can do this in a few ways, but I suspect that the easiest way is for you to use table. Using 'table', you can return a frequency on the dates. This is basically just a count of the dates in the data frame.

You can do the same thing after extracting the hour - you can even nest it by doing table(DF$DATE,DF$HOUR). Using as.data.frame would get you a listing somewhat like what you are looking for.

EDITED TO ADD: In response to your edit of the question, you can use a factor level to get the zero levels in the table statement. table respects your factor levels by including them in it's output, even if it is not found in the input (in fact, I believe that table coerces the input into factors on the back side).

Example code:

# Set options and load example data
options(stringsAsFactors = FALSE)
date.data <- data.frame(DATE = c("2014-02-15","2014-02-15","2014-04-15","2014-05-15","2014-06-15","2014-06-15"),
                        TIME = c("15:02","15:12","02:02","11:02","15:42","16:02"))

# Extract the hour
date.data$HOUR <- sapply(X = strsplit(x = date.data$TIME,split = ":"),FUN = `[[`,1)

# Now, set the hours as a factor level - this will allow table() to fill the data in as you are requesting
date.data$HOUR <- factor(x = date.data$HOUR,
                         levels = c("00","01","02","03","04","05",
                                    "06","07","08","09","10","11",
                                    "12","13","14","15","16","17",
                                    "18","19","20","21","22","23"),
                         labels = c("00","01","02","03","04","05",
                                    "06","07","08","09","10","11",
                                    "12","13","14","15","16","17",
                                    "18","19","20","21","22","23"))

# Obtain the first table of interest
as.data.frame(table(date.data$DATE))

        Var1 Freq
1 2014-02-15    2
2 2014-04-15    1
3 2014-05-15    1
4 2014-06-15    2

# And the second table
as.data.frame(table(date.data$DATE,date.data$HOUR))

         Var1 Var2 Freq
1  2014-02-15   00    0
2  2014-04-15   00    0
3  2014-05-15   00    0
4  2014-06-15   00    0
5  2014-02-15   01    0
6  2014-04-15   01    0
7  2014-05-15   01    0
8  2014-06-15   01    0
....

Upvotes: 0

Related Questions