Duncan McArdle
Duncan McArdle

Reputation: 521

Is there a better way to optimise / index this query?

I have a semi-large (10,000,000+ record) credit card transaction database that I need to query regularly. I have managed to optimise most queries to be sub 0.1 seconds but I'm struggling to do the same for sub-queries.

The purpose of the following query is to obtain the number of "inactive" credit cards (credit cards that have not made a card transaction in the last x days / weeks) for both the current user's company, and all companies (so as to form a comparison).

The sub-query first obtains the last card transaction of all credit cards, and then the parent query removes any expired credit cards, and groups the card based on their associated company and whether or not the they are deemed "inactive" (the (UNIX_TIMESTAMP() - (14 * 86400)) is used in place of a PHP time calculation.

SELECT
    SUM(IF(LastActivity < (UNIX_TIMESTAMP() - (14 * 86400)), 1, 0)) AS AllInactiveCards,
    SUM(IF(LastActivity >= (UNIX_TIMESTAMP() - (14 * 86400)), 1, 0)) AS AllActiveCards,
    SUM(IF(LastActivity < (UNIX_TIMESTAMP() - (14 * 86400)) AND lastCardTransactions.CompanyID = 15, 1, 0)) AS CompanyInactiveCards,
    SUM(IF(LastActivity >= (UNIX_TIMESTAMP() - (14 * 86400)) AND lastCardTransactions.CompanyID = 15, 1, 0)) AS CompanyActiveCards
FROM CardTransactions
JOIN
(
    SELECT
        CardSerialNumberID,
        MAX(CardTransactions.Timestamp) AS LastActivity,
        CardTransactions.CompanyID
    FROM CardTransactions
    GROUP BY
        CardTransactions.CardSerialNumberID, CardTransactions.CompanyID
) lastCardTransactions
ON
    CardTransactions.CardSerialNumberID = lastCardTransactions.CardSerialNumberID AND
    CardTransactions.Timestamp = lastCardTransactions.LastActivity AND
    CardTransactions.CardExpiryTimestamp > UNIX_TIMESTAMP()

The indexes in use are on CardSerialNumberID, CompanyID, Timestamp for the inner query, and CardSerialNumberID, Timestamp, CardExpiryTimestamp, CompanyID for the outer query.

The query takes around 0.4 seconds to execute when done multiple times, but the initial run can be as slow as 0.9 - 1.1 seconds, which is a big problem when loading a page with 4-5 of these types of query.

One thought I did have was to calculate the overall inactive card number in a routine separate to this, perhaps run daily. This would allow me to adjust this query to only pull records for a single company, thus reducing the dataset and bringing the query time down. However, this is only really a temporary fix, as the database will continue to grow until the same amount of data is being analysed anyway.

Note: The query above's fields have been modified to make them more generic, as the specific subject this query is used on is quite complex. As such there is no DB schema to give (and if there was, you'd need a dataset of 10,000,000+ records anyway to test the query I suppose). I'm more looking for a conceptual fix than for anyone to actually give me an adjusted query.

Any help is very much appreciated!

Upvotes: 0

Views: 49

Answers (3)

Wilson Hauck
Wilson Hauck

Reputation: 2343

Rather than repetitively calculating - 14 days and current UNIX_TIMESTAMP(), follow advice of https://code.tutsplus.com/tutorials/top-20-mysql-best-practices--net-7855 then prior to SELECT .....

code similar to:

$uts_14d = UNIX_TIMESTAMP() - (14 * 86400);
$uts = UNIX_TIMESTAMP();

and substitute the ($uts_14d and $uts) variables result in 5 lines of your code?

Upvotes: 0

Rick James
Rick James

Reputation: 142540

Please use different "aliases" for the two instances of Transactions. What you have is confusing to read.

The inner GROUP BY:

SELECT       card_sn, company, MAX(ts) 
    FROM Trans
    GROUP BY card_sn, company

Now this index is good ("covering") for the inner:

INDEX(CardSerialNumberID, CompanyID, Timestamp)

Recommend testing (timing) the subquery by itself.

For the outside query:

INDEX(CardSerialNumberID, Timestamp,  -- for JOINing (prefer this order)
      CardExpiryTimestamp, CompanyID) -- covering (in this order)

Please move CardTransactions.CardExpiryTimestamp > UNIX_TIMESTAMP() to a WHERE clause. It is helpful to the reader that the ON clause contain only the conditions that tie the two tables together. The WHERE contains any additional filtering. (The Optimizer will run this query the same, regardless of where you put that clause.)

Oh. Can that filter be applied in the subquery? It will make the subquery run faster. (It may impact the optimal INDEX, so I await your answer.)

I have assumed that most rows have not "expired". If they have, then other techniques may be better.

For much better performance, look into building and maintaining summary tables of the info. Or, perhaps, rebuild (daily) a table with these stats. Then reference the summary table instead of the raw data.

If that does not work, consider building a temp table with the "4-5" info at the start of the web page, then feed off it the tmp table.

Upvotes: 0

user2314737
user2314737

Reputation: 29427

You're querying the table transactions two times, so your query has a size of Transactions x Transactions, which might be big.

One idea would be to monitor all credit cards for the last x days/weeks and save them in an extra table INACTIVE_CARDS that gets updated every day (add a field with the number of days of inactivity). Then you could limit the SELECT in your subquery to just search in INACTIVE_CARDS

SELECT
    CardSerialNumberID,
    MAX(Transactions.Timestamp) AS LastActivity,
    Transactions.CompanyID
FROM Transactions
WHERE CardSerialNumberID in INACTIVE_CARDS
GROUP BY
    Transactions.CardSerialNumberID, Transactions.CompanyID

Of course a card might have become active in the last hour, but you don't need to check all transactions for that.

Upvotes: 1

Related Questions