J fast
J fast

Reputation: 53

How to summarize by Quarter in R

I am having some difficulties on summarizing data from my database in R. I am looking to pull the data and have it summarized by Quarter.

Below is the code i am using to get a txt output but I am getting errors.

What do I need to do to manipulate the code to run this so that I can have the data be summarized by quarter?

library(data.table, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

################
## PARAMETERS ##
################

# Set path of major source folder for raw transaction data
in_directory <- "C:/Users/name/Documents/Raw Data/"

# List names of sub-folders (currently grouped by first two characters of 
CUST_ID)
in_subfolders <- list("AA-CA", "CB-HZ", "IA-IL", "IM-KZ", "LA-MI", "MJ-MS",
                  "MT-NV", "NW-OH", "OI-PZ", "QA-TN", "TO-UZ",
                  "VA-WA", "WB-ZZ")

# Set location for output
out_directory <- "C:/Users/name/Documents/YTD Master/"
out_filename <- "NEW.csv"

# Set beginning and end of date range to be collected - year-month-day format
date_range <- interval(as.Date("2018-01-01"), as.Date("2018-05-31"))

# Enable or disable filtering of raw files to only grab items bought within 
certain months to save space.
# If false, all files will be scanned for unique items, which will take 
longer and be a larger file.
date_filter <- TRUE


##########
## CODE ##
##########

starttime <- Sys.time()
mastertable <- NULL

for (j in 1:length(in_subfolders)) {
  subfolder <- in_subfolders[j]
  sub_directory <- paste0(in_directory, subfolder, "/")

  ## IMPORT DATA
  in_filenames <- dir(sub_directory, pattern =".txt")

  for (i in 1:length(in_filenames)) {

    # Default value provided for when fast filtering is disabled.
    read_this_file <- TRUE

    # To fast filter the data, we choose to include or exclude an entire file 
based on the date of its first line.
    # WARNING: This is only a valid method if filtering by entire months, 
since that is the amount of data housed in each file.
    if (date_filter) {
      temptable <- fread(paste0(sub_directory, in_filenames[i]), 
colClasses=c(CUSTOMER_TIER = "character"),
                     na.strings = "", nrows = 1)
      temptable[, INVOICE_DT := as.Date(INVOICE_DT)]

      # If date matches, set read flag to TRUE.  If date does not match, set 
read flag to FALSE.
  read_this_file <- temptable[, INVOICE_DT] %within% date_range
}


if (read_this_file) {
  print(Sys.time()-starttime)
  print(paste0("Reading in ", in_filenames[i]))
  temptable <- fread(paste0(sub_directory, in_filenames[i]), colClasses=c(CUSTOMER_TIER = "character"),
                     na.strings = "")
  temptable <- temptable[, lapply(.SD, sum), by = quarter(INVOICE_DT),
                         .SDcols = c("INV_ITEM_ID","Ext Sale", "Ext Total Cost", "CE100", "CE110","CE120","QTY_SOLD","PACKSLIP_WHSL")]

  # Combine into full list
  mastertable <- rbindlist(list(mastertable, temptable), use.names = TRUE)
  # Release unneeded memory
  rm(temptable)

}

 }

}

# Save Final table
print("Saving master table")
fwrite(mastertable, paste0(out_directory, out_filename))
rm(mastertable)

print(Sys.time()-starttime)

After running this scrip the below is the error message i receive.

Error in gsum(INV_ITEM_ID) : Type 'character' not supported by GForce sum (gsum). Either add the prefix base::sum(.) or turn off GForce optimization using options(datatable.optimize=1)

Upvotes: 0

Views: 2161

Answers (1)

Puddlebunk
Puddlebunk

Reputation: 493

Here is the general approach with some generic data.

library(tidyverse)
library(lubridate)
data.frame(date = seq(as.Date('2010-01-12'), as.Date('2018-02-03'), by = 100),
                 var = runif(30)) %>%
  group_by(quarter(date, with_year = T)) %>%
  summarize(average_var = mean(var))

you can leave out the "with_year = T" if you don't care about the differences between years.

Upvotes: 1

Related Questions