Reputation: 2784
I have this table
attendance (4M rows at the moment, growing 1.2M per week):
-------------------------------------------------------------
| member_id | attendance_week | attendance_date | event_id |
------------------------------------------------------------
| INT (10) | TINYINT(2) | TIMESTAMP |TINYINT(3) |
-------------------------------------------------------------
attendance indeces:
--------------------------------------------------
| PRIMARY (attendance_week, member_id, event_id) |
| member_id (member_id) |
| event_id (event_id, attendance_week)
| total (attendance_week, event_id) |
--------------------------------------------------
members (400k rows at the moment growing 750 a week):
-------------------------
| member_id | dept_id |
-------------------------
| INT (10) |SMALLINT(5)|
-------------------------
member indeces:
-----------------------
| PRIMARY (member_id) |
|
-----------------------
Events are weekly, meaning you'll see pairs of member_id
and event_id
for each week.
Now I have to generate a report of for a certain department each event, current attendance
(i.e. if this member already checked-in), and their attendance over at least 4 weeks (i.e. attended
/ total
events for a duration)
This is for the current_attendance
part of the report. I fetch all members for a department and LEFT JOIN
it with this week's event to get NULL
for absences:
SELECT
m.member_id AS id,
a.event_id AS attended
FROM
members AS m
LEFT JOIN
attendance AS a
ON
a.member_id = m.member_id AND
a.attendance_week = :week AND
a.event_id = :event
WHERE
m.dept_id = :dept
GROUP BY
m.member_id
This is for the attended
part of the report. :
SELECT
a.member_id,
COUNT(a.event_id)
FROM
attendance a
JOIN
members m
ON
a.member_id = m.member_id AND
m.dept_id = :dept
WHERE
a.attendance_week BETWEEN :start AND :end
GROUP BY
a.member_id
I could probably merge these two queries by simply LEFT JOIN
-ing the attendance
table again on the first query.
And finally for the total
part
SELECT
attendance_week,
COUNT(DISTINCT event_id)
FROM
attendance
WHERE
attendance_week BETWEEN :start AND :end
GROUP BY
attendance_week
These are the main queries that will be run for these tables. At this moment, the queries run for an average of 150 - 200ms (according to phpMyAdmin) which I think is slow. EXPLAIN
tells me that my indeces are being used.
So here are my questions:
SQL_NO_CACHE
and I still get the same response time, and query_cache_size
is 0. I could swear that I saw phpMyAdmin report the queries at around 800ms once (which is unacceptable) but I don't get them now. How do I measure the true speed of my queries everytime they are run?If you ever need more info, please comment below. I'll try to provide it. I really did try to do this alone, but given the demands of a huge database (my largest so far) and high performance, I really need some advice :D
Thanks
EDIT
I just realized a terrible flaw in my logic, newly registered members will show up having low attendance performance since the 3rd query doesn't take registration date into account. I have a registration_date column in my members table, is there any way I can incorporate that variable into the query? Or merge all three queries in just once? Since they all return values that are dependent on each user.
EDIT
I've managed to merge the first two queries:
SELECT
m.member_id AS id,
a.event_id AS attended,
COUNT(b.event_id) AS total_attended
FROM
members AS m
LEFT JOIN
attendance AS a
ON
a.member_id = m.member_id AND
a.attendance_week = :week AND
a.event_id = :event
LEFT JOIN
attendance AS b
ON
b.member_id = m.member_id AND
b.attendance_week BETWEEN :start AND :end
WHERE
m.dept_id = :dept
GROUP BY
m.member_id
This query runs for 925ms on the first run and 15ms on subsequent requests.
This is the result of the above query's EXPLAIN
members table:
id: 1
select_type: SIMPLE
table: m
type: ref
possible_keys: dept_id
key: dept_id
key_len: 3
ref: const
rows: 88
Extra: Using where; Using index
attendance table 1 (for the boolean attended part):
id: 1
select_type: SIMPLE
table: a
type: eq_ref
possible_keys: PRIMARY,member_id,event_id,total
key: PRIMARY
key_len: 6
ref: const,arms_db.m.member_id,const
rows: 1
Extra: Using index
attendance table 2 (for the total attendanded part):
id: 1
select_type: SIMPLE
table: b
type: ref
possible_keys: PRIMARY,member_id,total
key: member_id
key_len: 4
ref: arms_db.m.member_id
rows: 5
Extra: Using index
And the EXPLAIN
for the last query:
id: 1
select_type: SIMPLE
table: attendance
type: range
possible_keys: PRIMARY,toral
key: total
key_len: 2
ref: NULL
rows: 9
Extra: Using where; Using index for groub-by
Upvotes: 3
Views: 279
Reputation: 29121
Adding covering or clustered indexes on tables will give you the best performance:
You can add extra index on table member also:
member indeces:(member_id, dept_id)
you can enable Query Cache to cache query output but Query Cache doesn't work with procedures. To measure exact speed of queries you can use mysqlslap client utility
.
Queries inside stored procedure won't make much difference in terms of speed but it will save some additional overhead of query parsing and sending output to client.
Distributing data over different servers using sharding or replication will help you in terms of scalability. Partitioning on huge tables will also benefit you.
Upvotes: 2
Reputation: 22905
Your design seems valid. I think, that having reports done within 200ms (even up to 800ms) is perfectly fine for the Reporting applications. As to the new indexes, I would first checked if it really worth doing, 'cos, say, if you have all you members equally spread over only 5 depts, then index on member.dept_id
will not be usefull — it is cheaper to perform a full scan in such a case.
I don't see the point of measuring the “true” speed of the queries, as Databases are there to speed up data access by effectivelly caching your data. So if you're in a situation when on a freshly started DB server your query takes round 800ms and further executions' times go down to 50-100ms, then this a good setup and this is what I'm aiming for in my daily job.
I doubt it, as stored procedures will give you a small extra time required to execute the procedure and obtain it's results, compared to the benefit of having all statements parsed by the time procedure is called.
At the moment your speed is just fine for non-OLTP application. And for me it seems that partitioning attendance
table by the attendance_week
column will give you a nice performance boost, as all your queries go around this column. But benefits will be visible when you'll have more data in the system, at least 3-4 weeks worth of it.
My assumptions might be wrong, though, for the OLTP system. Could you specify the intened usage area of the provided example?
Also, it'd be good to see the actual output of EXPLAIN
statements for your queries.
Upvotes: 0