Adam Copley
Adam Copley

Reputation: 1495

JOIN or 2 queries - 1 large table, 1 small, hardware limited

I have a page in which there is a <select> menu, which contains all of the values from a small table (229 rows), such that <option value='KEY'>VALUE</option>.

This select menu is a filter for a query which runs on a large table (3.5M rows). In the large table is a foreign key which references KEY from small table.

However, in the results of the large table query, I also need to display the relative VALUE from the small table.

I could quite easily do an INNER JOIN to retrieve the results, OR I could do a separate 'pre'-query to my smaller table, fetch it's values into an array, and then let the application get the VALUE from small table results.

The application is written in PHP.

Hardware resources IS an issue (cannot upgrade to higher instance right now, boss constrained) - I am running this on a t2.micro RDS on Amazon Web Services instance. I have added both single and covering indexes on columns in WHERE & HAVING clauses, and my server is reporting that I have 46mb RAM available.

Given the above, I know that JOIN can be expensive especially on big tables. Does it just make sense here to do 2 queries, and let the application handle some of the work, until I can negotiate better resources?

EDIT:

No Join : 6.9 sec

SELECT nationality_id, COUNT(DISTINCT(txn_id)) as numtrans,
        SUM(sales) as sales, SUM(units) as units, YrQtr
FROM 1_txns
 GROUP BY nationality_id;

EXPLAIN
'1', 'SIMPLE', '1_txns', 'index', 'covering,nat', 'nat', '5', NULL, '3141206', NULL

With Join: 59.03 Sec

SELECT 4_nationality.nationality, COUNT(DISTINCT(txn_id)) as numtrans,
        SUM(sales) as sales, SUM(units) as units, YrQtr
FROM 1_txns INNER JOIN 4_nationality USING (nationality_id)
 GROUP BY nationality_id
 HAVING YrQtr LIKE :period;
EXPLAIN
'1', 'SIMPLE', '4_nationality', 'ALL', 'PRIMARY', NULL, NULL, NULL, '229', 'Using temporary; Using filesort'
'1', 'SIMPLE', '1_txns', 'ref', 'covering,nat', 'nat', '5', 'reports.4_nationality.nationality_id', '7932', NULL

Schema is

Table 1_txns (txn_id, nationality_id, yrqtr, sales, units)
Table 4_nationality (nationality_id, nationality)

I have separate indexes on each nationality_id, txn_id, yrqtr. in my Large Transactions Table. And just a primary key index on my small table.

Something strange also, is that the query WITHOUT the join, is missing a row from it's results!

Upvotes: 0

Views: 55

Answers (2)

Rick James
Rick James

Reputation: 142208

First, move the HAVING to WHERE so that the rest of the query has less to do. Second, delay the lookup of nationality until after the GROUP BY:

SELECT  
        ( SELECT  nationality
            FROM  4_nationality
            WHERE  nationality_id = t.nationality_id 
        ) AS nationality,
        COUNT(DISTINCT(txn_id)) as numtrans,
        SUM(sales) as sales,
        SUM(units) as units,
        YrQtr
    FROM  1_txns AS t
    WHERE  YrQtr LIKE :period
    GROUP BY  nationality_id;

If possible, avoid wild cards and simply do YrQtr = :period. That would allow INDEX(YrQtr, nationality_id) for even more performance.

Upvotes: 0

DRapp
DRapp

Reputation: 48139

If your lookup "menu" list table is only the 229 rows as stated, and it has a unique key, and your menu table has index on (key, value), the join would be negligible... especially if your only querying the results based on a single key anyhow.

The bigger question to me would be on your table of 3.5 million records. At 229 "menu" items, it would be returning an average of over 15k records each time. And I am sure that not every category is evenly balanced... some could have a few hundred or thousand entries, others could have 30k+ entries. Is there some other criteria that would allow smaller subsets to be returned? Obviously not enough info to quantify.

Now, after seeing your revised post while entering this, I see you are trying to get aggregations. The table would otherwise be fixed for historical data. I would suggest a summary table be done on a per Nationality/YrQtr basis. This way, you can query that directly if the period is PRIOR to the current period in question. If current period, then sum aggregates from production. Again, since transactions wont change historically, neither would their counts and you would have immediate response from the pre-summary table.

Feedback

As for how / when to implement a summary table. I would create the table with the respective columns you need... Nationality, Period (Yr/Month), and respective counts for distinct transactions, etc.

I would then pre-aggregate once for all your existing data for everything UP TO but not including the current period (Yr/Month). Now you have your baseline established in summary.

Then, add a trigger to your transaction table on insert. Then, process something like... (AND NOTE, THIS IS NOT ACTUAL TRIGGER, BUT CONTEXT OF WHAT TO DO)

update summaryTable
   set numTrans = numTrans + 1,
       TotSales = TotSales + NEWENTRY.Sales,
       TotUnits = TotUnits + NEWENTRY.Units
   where
           Nationality = NEWENTRY.Nationality
       AND YrQtr = NEWENTRY.YrQtr

if # records affected by the update = 0
   Insert into SummaryTable 
      ( Nationality, 
        YrQtr, 
        NumTrans, 
        TotSales, 
        TotUnits )
     values
     (  NEWENTRY.Nationality,
        NEWENTRY.YrQtr,
        1,
        NEWENTRY.Sales,
        NEWENTRY.Units )

Now, your aggregates will ALWAYS be in synch in the summary table after EVERY record inserted into the transaction table. You can ALWAYS query this summary table instead of the full transaction table. If you never have activity for a given Nationality / YrQtr, no record will exist.

Upvotes: 1

Related Questions