Reputation: 1833
I was able to select last 150 days from database when having column 'year' as follow:
data1 = dbGetQuery(conn_data, statement=paste("SELECT *, STR_TO_DATE(CONCAT(yyyy,'-',mm,'-',dd),'%Y-%m-%d') as dt FROM stations_daily_data", "WHERE STR_TO_DATE(CONCAT(yyyy,'-',mm,'-',dd),'%Y-%m-%d') >= DATE_SUB(CURDATE(), INTERVAL 150 DAY)"))
But now all data were averaged to date and thus only have columns 'month' and 'day' (no column 'year'), and I was stuck in how to select last 150 days this time. Here is the simplified example of data frame with original one of 17 million rows:
df <- data.frame(ID=c(1:5,50001:50005),mm=c(rep(1,5),rep(12,5)),dd=c(1:5,27:31),value=c(21:30))
Feb 29 can be ignored since 150 days is a significant amount of time period.
I tried add column 'year' so that I could use the code above, but it would be wrong if say, current date is at the beginning of a year, also make changes to a big table in R would run out of R memory, I'm not familiar with database query, is it possible that I can do this by just using query instead of read the table into R and then make changes in the data frame in R, any suggestion would be appreciated!
EDIT: The column 'year' is no longer needed since its all been averaged to date, which means now May 5th would be the average of 60 years of May 5th of each year. Next I would like to select last 150 days(averaged), the reason I tried to add column 'year' was simply try to make it easier to select.
Since I need to run the data every day, so if the day is after the month of June it would be easy just to use the current year, but if it's the month of February, then it would be current year-1, this could be done if the data is much smaller, now if I make change to the data frame, the R would pop out error of 'out of memory', that's why I was wondering if there is a way to select in database query or functions in R that wouldn't cost much memory, thanks!
Upvotes: 0
Views: 290
Reputation: 9582
You could write a function to calculate year based on a reference year plus an adjustment based on a cut off month. Then you could use the order
function to order the data.frame based on calculated year, month, and day, without inserting the new calculated year field into the data.frame.
Won't have a great performance on 17 million row dataset though, since you are still ordering every row.
# some dummy data (not worrying about illegal dates like Feb 31)
set.seed(123)
da <- data.frame(mm=sample(1:12, 20, replace=T),
dd=sample(1:31, 20, replace=T))
# function to calculate year from reference year and cut off month
calc_year <- function(mm_vec, ref_year, cut_month) {
ref_year + ifelse(mm_vec >= cut_month, 0, -1)
}
# order the data.frame by year, month, and day
# (taking 2014 as ref. year & assuming months before June are from prior year
da[with(da, order(calc_year(mm_vec=mm, ref_year=2014, cut_month=6), mm, dd)), ]
# if you want just the first 5 rows
da[with(da, order(calc_year(mm_vec=mm, ref_year=2014, cut_month=6), mm, dd)), ][1:5,]
Upvotes: 0