Reputation: 521
I have a semi-large (10,000,000+ record) credit card transaction database that I need to query regularly. I have managed to optimise most queries to be sub 0.1 seconds but I'm struggling to do the same for sub-queries.
The purpose of the following query is to obtain the number of "inactive" credit cards (credit cards that have not made a card transaction in the last x days / weeks) for both the current user's company, and all companies (so as to form a comparison).
The sub-query first obtains the last card transaction of all credit cards, and then the parent query removes any expired credit cards, and groups the card based on their associated company and whether or not the they are deemed "inactive" (the (UNIX_TIMESTAMP() - (14 * 86400))
is used in place of a PHP time calculation.
SELECT
SUM(IF(LastActivity < (UNIX_TIMESTAMP() - (14 * 86400)), 1, 0)) AS AllInactiveCards,
SUM(IF(LastActivity >= (UNIX_TIMESTAMP() - (14 * 86400)), 1, 0)) AS AllActiveCards,
SUM(IF(LastActivity < (UNIX_TIMESTAMP() - (14 * 86400)) AND lastCardTransactions.CompanyID = 15, 1, 0)) AS CompanyInactiveCards,
SUM(IF(LastActivity >= (UNIX_TIMESTAMP() - (14 * 86400)) AND lastCardTransactions.CompanyID = 15, 1, 0)) AS CompanyActiveCards
FROM CardTransactions
JOIN
(
SELECT
CardSerialNumberID,
MAX(CardTransactions.Timestamp) AS LastActivity,
CardTransactions.CompanyID
FROM CardTransactions
GROUP BY
CardTransactions.CardSerialNumberID, CardTransactions.CompanyID
) lastCardTransactions
ON
CardTransactions.CardSerialNumberID = lastCardTransactions.CardSerialNumberID AND
CardTransactions.Timestamp = lastCardTransactions.LastActivity AND
CardTransactions.CardExpiryTimestamp > UNIX_TIMESTAMP()
The indexes in use are on CardSerialNumberID, CompanyID, Timestamp
for the inner query, and CardSerialNumberID, Timestamp, CardExpiryTimestamp, CompanyID
for the outer query.
The query takes around 0.4 seconds to execute when done multiple times, but the initial run can be as slow as 0.9 - 1.1 seconds, which is a big problem when loading a page with 4-5 of these types of query.
One thought I did have was to calculate the overall inactive card number in a routine separate to this, perhaps run daily. This would allow me to adjust this query to only pull records for a single company, thus reducing the dataset and bringing the query time down. However, this is only really a temporary fix, as the database will continue to grow until the same amount of data is being analysed anyway.
Note: The query above's fields have been modified to make them more generic, as the specific subject this query is used on is quite complex. As such there is no DB schema to give (and if there was, you'd need a dataset of 10,000,000+ records anyway to test the query I suppose). I'm more looking for a conceptual fix than for anyone to actually give me an adjusted query.
Any help is very much appreciated!
Upvotes: 0
Views: 49
Reputation: 2343
Rather than repetitively calculating - 14 days and current UNIX_TIMESTAMP(), follow advice of https://code.tutsplus.com/tutorials/top-20-mysql-best-practices--net-7855 then prior to SELECT .....
code similar to:
$uts_14d = UNIX_TIMESTAMP() - (14 * 86400);
$uts = UNIX_TIMESTAMP();
and substitute the ($uts_14d and $uts) variables result in 5 lines of your code?
Upvotes: 0
Reputation: 142540
Please use different "aliases" for the two instances of Transactions
. What you have is confusing to read.
The inner GROUP BY
:
SELECT card_sn, company, MAX(ts)
FROM Trans
GROUP BY card_sn, company
Now this index is good ("covering") for the inner:
INDEX(CardSerialNumberID, CompanyID, Timestamp)
Recommend testing (timing) the subquery by itself.
For the outside query:
INDEX(CardSerialNumberID, Timestamp, -- for JOINing (prefer this order)
CardExpiryTimestamp, CompanyID) -- covering (in this order)
Please move CardTransactions.CardExpiryTimestamp > UNIX_TIMESTAMP()
to a WHERE
clause. It is helpful to the reader that the ON
clause contain only the conditions that tie the two tables together. The WHERE
contains any additional filtering. (The Optimizer will run this query the same, regardless of where you put that clause.)
Oh. Can that filter be applied in the subquery? It will make the subquery run faster. (It may impact the optimal INDEX
, so I await your answer.)
I have assumed that most rows have not "expired". If they have, then other techniques may be better.
For much better performance, look into building and maintaining summary tables of the info. Or, perhaps, rebuild (daily) a table with these stats. Then reference the summary table instead of the raw data.
If that does not work, consider building a temp table with the "4-5" info at the start of the web page, then feed off it the tmp table.
Upvotes: 0
Reputation: 29427
You're querying the table transactions two times, so your query has a size of Transactions x Transactions, which might be big.
One idea would be to monitor all credit cards for the last x days/weeks and save them in an extra table INACTIVE_CARDS that gets updated every day (add a field with the number of days of inactivity). Then you could limit the SELECT in your subquery to just search in INACTIVE_CARDS
SELECT
CardSerialNumberID,
MAX(Transactions.Timestamp) AS LastActivity,
Transactions.CompanyID
FROM Transactions
WHERE CardSerialNumberID in INACTIVE_CARDS
GROUP BY
Transactions.CardSerialNumberID, Transactions.CompanyID
Of course a card might have become active in the last hour, but you don't need to check all transactions for that.
Upvotes: 1