Reputation: 31
I am having a table containing currently about 5 million rows.This is a live database where data is populated as a result of a scraping script.The script is inserting the data into the table continuously, For example:
The business listing site is giving me a JSON response on API call,this is parsed and inserted into the database.A duplication check also happens in between.And on a later phase I am taking he data obtained to get reports.
While trying to take reports based on the stored information it's taking too long to complete the script execution. The scraping script is live and continues to update the table with records in the future. Every month its expected to get .7 - 1 million new records.
Following is the structure of my table,
CREATE TABLE IF NOT EXISTS `biz_listing` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`lid` smallint(11) NOT NULL,
`name` varchar(300) NOT NULL,
`type` enum('cat1','cat2') NOT NULL,
`location` varchar(300) NOT NULL,
`businessID` varchar(300) NOT NULL,
`reviewcount` int(6) NOT NULL,
`city` varchar(300) NOT NULL,
`categories` varchar(300) NOT NULL,
`result_month` varchar(10) NOT NULL,
`updated_date` date NOT NULL,
PRIMARY KEY (`id`),
KEY `biz_date` (`businessID`,`updated_date`),
KEY `type_date` (`type`,`updated_date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The records fall under two categories, 'cat1' and 'cat2' . (I am plaaning to add a new category ,say cat3)
I need to have a same station aggregate report section,which shows business IDs which fall across every month in a selected range of months.
Here it is chosen as June-July 2014.
Report on aggregate numbers # category
SELECT COUNT(t.`businessID`) AS bizcount, SUM(t.reviewcount) AS reviewcount, t.`type`
FROM `biz_listing` t
INNER JOIN
( SELECT `businessID`,count(*) c FROM `biz_listing` WHERE updated_date BETWEEN '2014/06/01' AND LAST_DAY('2014/07/01') GROUP
BY `businessID`,`type` HAVING c = 2 ) t2
ON t2.`businessID` = t.`businessID`
WHERE updated_date BETWEEN '2014/07/01' AND LAST_DAY('2014/07/01') GROUP BY t.`type`
EXPLAIN (done on a backup table 4 million)
Report on aggregate numbers # based on cities
SELECT COUNT(t.`businessID`) AS bizcount, SUM(t.reviewcount) AS reviewcount, t.`type`, t.`location` as city
FROM `biz_listing` t
INNER JOIN
( SELECT `businessID`,count(*) c FROM `biz_listing` WHERE updated_date BETWEEN '2014/06/01' AND LAST_DAY('2014/07/01') GROUP
BY `businessID`,`type` HAVING c = 2 ) t2
ON t2.`businessID` = t.`businessID`
WHERE updated_date BETWEEN '2014/07/01' AND LAST_DAY('2014/07/01') GROUP BY t.`location`, t.`result_month`
Here we selecting range of months (June-July) , so it will list all the businessID
common in both range of months,
1st query will output according to type of Business
2nd query will output according to location
The problem is it considerably taking very long time to execute the query (600 seconds and more) also, some times the query dies before completion.
Please suggest me on optimizations for the query if you find so.
I think indexing is affecting insertion performance of the scraping script. How can I modify the current script considering insertion and retrieval performance?
Thanx in advance.
EDIT
I tried the suggested covering indexes and its taking much more time than usual :(
EXPLAIN is as follows:
Upvotes: 2
Views: 2240
Reputation: 108651
This is a MyISAM table, which offers less contention between inserting queries and reporting queries than InnoDB. Therefore, let's focus first on the reporting queries. It is true that indexes slow down inserts. But queries slow down a LOT because of missing or incorrect indexes.
To troubleshoot this performance problem it's helpful for clarity to consider the various subqueries separately, I believe.
So let's start with one of them.
SELECT `businessID`,
count(*) c
FROM `biz_listing`
WHERE updated_date BETWEEN '2014/06/01' AND LAST_DAY('2014/07/01')
GROUP BY `businessID`,`type`
HAVING c = 2
This subquery is straightforward, and basically well-constructed. It's capable of using an index to jump to the first record meeting the updated_date range criterion, then linearly scan that index looking for the last record. As it scans the index, if it finds the type
column in it, it can collect the record counts it needs to satisfy the query as it scans the index. That's fast.
But, you don't have that index! So this subquery is doing a full table scan. As we say in New England, that's wicked slow.
If you took your compound covering index (type,updated_date)
index and exchanged the order of the two columns in it to give (updated_date,type)
, it would serve as a high-performance covering index for this query. The order in which the columns appear in your compound index is incorrect to make the index helpful for this query.
Let's take a look at your first main query in the same light (omitting the subquery).
SELECT COUNT(t.`businessID`) AS bizcount,
SUM(t.reviewcount) AS reviewcount, t.`type`
FROM `biz_listing` t
WHERE updated_date BETWEEN '2014/07/01' AND LAST_DAY('2014/07/01')
GROUP BY t.`type`
(Something's not clear here. You say COUNT(t.businessID)
here, but it's possible you want COUNT(DISTINCT t.businesscount)
. What you have will give the same result as COUNT(*)
because there are no NULL values of businessID
. If you do this, you can put HAVING SUM(DISTINCT businessID) > 2
in the query and get rid of your need for the subquery.)
This query works similarly to the previous one. It scans an index over the updated_date
range, then by type
, then picks up values of businessID
and reviewcount
. So a compound index in this order will allow this query to be satisfied by a pure index scan, which will be fast.
(updated_date, type, businessID,reviewcount)
Notice that any query that can be satisfied from the (updated_date, type)
index can also be satisfied from this one, so you don't need them both.
Go read about compound covering indexes, tight range scans, and loose range scans.
Your other query will probably be greatly improved by this same index. Give it a try.
You have a backup table it seems. You can experiment with various compound indexes in that table until you get good results.
I'm reluctant to give this sort of advice:
TL;DR: change your indexes from this to that
because then you may just come back to SO with the next question and be tempted to become a support leech. Can I avoid being a "leech" when I am a beginner in a topic and only ask questions?
You know... teach a person to fish, etc.
Upvotes: 1