Counting all rows with specific columns and grouping by week

Question

I've been trying now for some time to create a query that would count all rows from a table per day that include a column with certain id, and then group them to weekly values based on the UNIX timestamp column. I have a medium sized dataset with 37 million rows, and have been trying to run following kind of query:

SELECT DATE(timestamp), COUNT(*) FROM `table` WHERE ( date(timestamp) 
between "YYYY-MM-DD" and "YYYY-MM-DD" and column_group_id=X ) 
group by week(date(startdate))

Though I'm getting weird results, and the query doesn't group the counts correctly but shows too large values on the resulting count column (I verified the value errors by querying very small spesific datasets.)

If I group by date(startdate) instead, the row counts match per day basis but I'd like to combine these daily amount of rows to weekly amounts. How this could be possible? The data is needed in format:

2006-01-01 | 5 
2006-01-08 | 10

so that the day timestamp is the first column and second is the amount of rows per week.

GarethD · Accepted Answer

Your query is non deterministic so it is not surprising you are getting unexpected results. By this I mean you could run this query on the same data 5 times and get 5 different result sets. This is due to the fact you are selecting DATE(timestamp) but grouping by WEEK(DATE(startdate)), the query is therefore returning the time of the first row it comes accross per startdate week in ANY order.

Consider the following 2 rows (with timestamp in date format for ease of reading):

TimeStamp       StartDate
20120601        20120601
20120701        20120601

Your query is grouping by WEEK(StartDate) which is 23, since both rows evaluate to the same value you would expect your results to have 1 row with a count of 2.

HOWEVER DATE(Timestamp) Is also in the select list and since there is no ORDER BY statement the query has no idea which Timestamp to return '20120601' or '20120701'. So even on this small result set you have a 50:50 chance of getting:

TimeStamp       COUNT
20120601        2

and a 50:50 chance of getting

TimeStamp       COUNT
20120701        2

If you add more data to the dataset as so:

TimeStamp       StartDate
20120601        20120601
20120701        20120601
20120701        20120701

You could get

TimeStamp       COUNT
20120601        2
20120701        1

or

TimeStamp       COUNT
20120701        2
20120701        1

You can see how with 37,000,000 rows you will soon get results that you do not expect and cannot predict!

EDIT

Since it looks like you are trying to get the weekstart in your results, while group by week you could use the following to get the week start (replacing CURRENT_TIMESTAMP with whichever column you want):

SELECT  DATE_ADD(CURRENT_TIMESTAMP, INTERVAL 1 - DAYOFWEEK(CURRENT_TIMESTAMP) DAY) AS WeekStart

You can then group by this date too to get weekly results and avoid the trouble of having things in your select list that aren't in your group by.

Counting all rows with specific columns and grouping by week

Answers (2)

Related Questions