Reputation: 599

mysql create index on table with a 100 million rows

I have few MySQL tables-these have around 300 columns and 100 million rows. These store data for log files, hence the size. I am using InnoDB engine. Few queries involving joins of these tables obviously do not work. I tried adding indices to these, but the queries do not finish at all.

I wanted to know if there is any other way to speed up performance, or some way to make the 'create index' work on the tables?

Thank you.

Upvotes: 8

Answers (3)

WilliamK

Reputation: 1772

The Bug

On MySQL tables using MyISAM engine there are some problems creating a new secondary index.

A known issue with MyISAM engine, on some MySQL versions like 5.7.24 (shipped with Wamp for instance) not only causes a table scan, as expected, but need a full table rebuild when you create an index. If you just drop an index, the table is also rebuilt :-(

Ref: https://bugs.mysql.com/bug.php?id=93530

Alternative

Sometimes you cannot upgrade MySQL or can't ask for customers to do that, to run your solution. Change engine to InnoDB can lead to another problems if you don't need all features InnoDB provides.

The Index Table

So, there is an approach that consists of creating an "index table" manually, with the benefit you can filter records that you really need, as I explain below:

Imagine you have a 100M records of companies of the world on a table, where about 30M are companies of USA and 10M from Canada, plus other companies.

Each company has a COUNTRY and a STATE field, you want to index, because you need to search USA or CANADA companies by it's state.

So, in MySQL if you create an index for Country and State, all 100M records will be indexed, even with NULL states.

To solve this you create an index-table and a real index, like this:

create table index_tb_companies (
   company_id int unique,
   company_country char(2),  -- US/CA
   company_state char(2)     -- AL/AK/.../WI/WY
);

create index index_tb_companies_index 
  on index_tb_companies (company_country, company_state);

Fill the Index Table

Now you can import original data to the index-table, with a simple insert into or replace into with a filtered select.

replace into index_tb_companies(
  company_id, company_country, company_state)
(select 
    company_id, company_country, company_state 
from original_company_table
    where country in ('US', 'CA')
);

This will take a while, since maybe you don't have a index for country yet, and need a full table scan. But the final index-table size will be lower then a MySQL index size, since only US/CA data will be in there.

How to Select

Now, the final part is to make use of the index-table with your specific report of US and CA companies, since other countries are not covered by the index.

select  o.*
from 
    original_company_table o INNER JOIN
    index_tb_companies idx ON idx.company_id = o.company_id
where
    idx.company_country = 'US' 
    and idx.company_state = 'NY'

This approach is particularly good when you want to index a tiny portion of your data on MySQL, so the index size is small.

Partial Index

Other databases, like PostgreSQL, have a "Partial Indexes", you can create regular indexes and pass a where clause on it's creation.

PG Partial Indexes: https://www.postgresql.org/docs/8.0/indexes-partial.html

Upvotes: 0

Bill Karwin

Reputation: 562498

Creating an index takes time, proportional to the number of rows in the table. 100 million rows is quite a lot for a MySQL table. It will probably take many hours to create an index on that table. Exactly how long varies, based on other factors including your server hardware, the data type of the columns you are creating the index for, other current load on the database, etc.

One tool that can help you is pt-online-schema-change. It actually takes longer to build the index, but you can continue to read and write the original table while it's working. Test with a smaller table so you get some experience with using this tool.

You can view a webinar about this tool here: ~~Zero-Downtime Schema Changes in MySQL (free to view, but requires registration)~~.

Another technique is to create an empty table like your original, create the index in that table, and then start copying data from your original table into the new table gradually. If this is a log table, it's likely that you write to the table more than you read from the table, so you can probably swap the tables immediately and start logging new events immediately, and backfill it over time.

A tool like pt-archiver can help you to copy data gradually without putting too much load on the server. Simply doing INSERT INTO... SELECT is not good for your database server's health if you try to copy 100 million rows in one transaction. It also puts locks on the original table. pt-archiver works by copying just a bite-sized chunk of rows at a time, so it avoids the high cost of such a large transaction.

If you use an auto-increment primary key, take care to adjust the value to be higher than the max value in the original table before you let log events start writing to it, so you don't accidentally id values more than once.

Upvotes: 20

developerwjk

Reputation: 8659

Use

 create table newtable like oldtable;

Then apply the index to the newtable while it is empty.

Then

 insert into newtable select * from oldtable;

This may also take a long time to finish.

Upvotes: 9