Mr J
Mr J

Reputation: 2973

How can I make this basic MySQL SELECT query faster on 1 billion rows?

I have a large 100GB dataset of website content that's around 1 billion rows that I've indexed into a single MySQL 8 table.

CREATE TABLE `inner_pages_content` (
    `id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
    `page_id` INT(10) UNSIGNED NULL DEFAULT NULL,
    `page_word` VARCHAR(50) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `word_source` TINYINT(3) UNSIGNED NULL DEFAULT NULL,
    `word_frequency` TINYINT(3) UNSIGNED NULL DEFAULT NULL,
    PRIMARY KEY (`id`) USING BTREE,
    INDEX `FK_inner_pages_content_inner_pages` (`page_id`) USING BTREE,
    INDEX `page_word` (`page_word`) USING BTREE,
    CONSTRAINT `FK_inner_pages_content_inner_pages` FOREIGN KEY (`page_id`) REFERENCES `rocket_test`.`inner_pages` (`id`) ON UPDATE CASCADE ON DELETE CASCADE
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB
AUTO_INCREMENT=1629374935

I'm then running queries similar to this against it to build up relevancy scores about specific pages against specific keywords.

SELECT page_id, word_source, word_frequency FROM inner_pages_content WHERE page_word IN 
('docker', 'software development', 'linux', 'microsoft', 'computer software');

Everything is working well apart from it being painfully slow. One solution I've found is to move everything into GCP's BigQuery but before I pursue that route permanently and write off CloudSQL/MySQL I'd like to see if there is anything I can do to keep my solution using CloudSQL/MySQL.

Both locally on my SSD and on CloudSQL with 500GB SSD these queries are taking 120-200 seconds to complete.

I believe one issue (strong chance I'm wrong!) is that I'm not putting the entire table in memory locally or in CloudSQL due to the costs of 100GB of memory so it's always pulling the results from the disk.

Interestingly when I COUNT(*) the matching results rather than SELECT it's very quick. Typically under 0.5 seconds.

Is there anything I can do to make this query quicker?

Additional information

enter image description here

Upvotes: 2

Views: 572

Answers (1)

Mr_Thorynque
Mr_Thorynque

Reputation: 2012

You don't have a index / query issue (because of the count(*) quick time. It's IO / network issue, so you can try to add selected fields into the index to put data in memory :

https://mariadb.com/kb/en/building-the-best-index-for-a-given-select/

part : "Covering" indexes

create index page_word on inner_pages_content (page_word , page_id, word_source, word_frequency)

Upvotes: 1

Related Questions