Reputation: 53
I am having some difficulties on summarizing data from my database in R. I am looking to pull the data and have it summarized by Quarter.
Below is the code i am using to get a txt output but I am getting errors.
What do I need to do to manipulate the code to run this so that I can have the data be summarized by quarter?
library(data.table, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
################
## PARAMETERS ##
################
# Set path of major source folder for raw transaction data
in_directory <- "C:/Users/name/Documents/Raw Data/"
# List names of sub-folders (currently grouped by first two characters of
CUST_ID)
in_subfolders <- list("AA-CA", "CB-HZ", "IA-IL", "IM-KZ", "LA-MI", "MJ-MS",
"MT-NV", "NW-OH", "OI-PZ", "QA-TN", "TO-UZ",
"VA-WA", "WB-ZZ")
# Set location for output
out_directory <- "C:/Users/name/Documents/YTD Master/"
out_filename <- "NEW.csv"
# Set beginning and end of date range to be collected - year-month-day format
date_range <- interval(as.Date("2018-01-01"), as.Date("2018-05-31"))
# Enable or disable filtering of raw files to only grab items bought within
certain months to save space.
# If false, all files will be scanned for unique items, which will take
longer and be a larger file.
date_filter <- TRUE
##########
## CODE ##
##########
starttime <- Sys.time()
mastertable <- NULL
for (j in 1:length(in_subfolders)) {
subfolder <- in_subfolders[j]
sub_directory <- paste0(in_directory, subfolder, "/")
## IMPORT DATA
in_filenames <- dir(sub_directory, pattern =".txt")
for (i in 1:length(in_filenames)) {
# Default value provided for when fast filtering is disabled.
read_this_file <- TRUE
# To fast filter the data, we choose to include or exclude an entire file
based on the date of its first line.
# WARNING: This is only a valid method if filtering by entire months,
since that is the amount of data housed in each file.
if (date_filter) {
temptable <- fread(paste0(sub_directory, in_filenames[i]),
colClasses=c(CUSTOMER_TIER = "character"),
na.strings = "", nrows = 1)
temptable[, INVOICE_DT := as.Date(INVOICE_DT)]
# If date matches, set read flag to TRUE. If date does not match, set
read flag to FALSE.
read_this_file <- temptable[, INVOICE_DT] %within% date_range
}
if (read_this_file) {
print(Sys.time()-starttime)
print(paste0("Reading in ", in_filenames[i]))
temptable <- fread(paste0(sub_directory, in_filenames[i]), colClasses=c(CUSTOMER_TIER = "character"),
na.strings = "")
temptable <- temptable[, lapply(.SD, sum), by = quarter(INVOICE_DT),
.SDcols = c("INV_ITEM_ID","Ext Sale", "Ext Total Cost", "CE100", "CE110","CE120","QTY_SOLD","PACKSLIP_WHSL")]
# Combine into full list
mastertable <- rbindlist(list(mastertable, temptable), use.names = TRUE)
# Release unneeded memory
rm(temptable)
}
}
}
# Save Final table
print("Saving master table")
fwrite(mastertable, paste0(out_directory, out_filename))
rm(mastertable)
print(Sys.time()-starttime)
After running this scrip the below is the error message i receive.
Error in gsum(INV_ITEM_ID) : Type 'character' not supported by GForce sum (gsum). Either add the prefix base::sum(.) or turn off GForce optimization using options(datatable.optimize=1)
Upvotes: 0
Views: 2161
Reputation: 493
Here is the general approach with some generic data.
library(tidyverse)
library(lubridate)
data.frame(date = seq(as.Date('2010-01-12'), as.Date('2018-02-03'), by = 100),
var = runif(30)) %>%
group_by(quarter(date, with_year = T)) %>%
summarize(average_var = mean(var))
you can leave out the "with_year = T" if you don't care about the differences between years.
Upvotes: 1