Heru S
Heru S

Reputation: 1323

Better Way of Storing Old Data for Faster Access

The application we are developing is writing around 4-5 millions rows of data every day. And, we need to save these data for the past 90 days.

The table user_data has the following structure (simplified):

id INT PRIMARY AUTOINCREMENT
dt TIMESTAMP CURRENT_TIMESTAMP
user_id varchar(20)
data varchar(20)

About the application:

At the moment, we are using MySQL InnoDB to store the latest data (i.e. 7 days and newer) and it is working fine and fits in the innodb_buffer_pool.

As for the older data, we created smaller tables in the form of user_data_YYYYMMDD. After a while, we figured that these tables cannot fit into the innodb_buffer_pool and it started to slow down.

We think that separating / sharding based on dates, sharding based on user_ids would be better (i.e. using smaller data sets based on user and dates such as user_data_[YYYYMMDD]_[USER_ID]). This will keep the table in much smaller numbers (only around 10K rows at most).

After researching around, we have found that there are a few options out there:

The biggest con I see in this is that we will have huge number of tables/collections/files when we do this (i.e. 13000 x 90 = 1.170.000). I wonder if we are approaching this the right way in terms of future scalability. Or, if there are other standardized solutions for this.

Upvotes: 2

Views: 201

Answers (4)

Joshua Huber
Joshua Huber

Reputation: 3533

1 million+ tables sounds like a bad idea. Having sharding via dynamic table naming by the app code at runtime has also not been a favorable pattern for me. My first go-to for this type of problem would be partitioning. You probably don't want 400M+ rows in a single unpartitioned table. In MySQL 5.7 you can even subpartition (but that gets more complex). I would first range partition on your date field, with one partition per day. Index on the user_id. If you are on 5.7 and want to dabble with subpartitioning, I would suggest range partition by date, then hash subpartition by user_id. As a starting point, try 16 to 32 hash buckets. Still index the user_id field.

EDIT: Here's something to play with:

CREATE TABLE user_data (
    id INT AUTO_INCREMENT
  , dt TIMESTAMP DEFAULT CURRENT_TIMESTAMP
  , user_id VARCHAR(20)
  , data varchar(20)
  , PRIMARY KEY (id, user_id, dt)
  , KEY (user_id, dt)
) PARTITION BY RANGE (UNIX_TIMESTAMP(dt))
  SUBPARTITION BY KEY (user_id)
  SUBPARTITIONS 16 (
    PARTITION p1 VALUES LESS THAN (UNIX_TIMESTAMP('2016-10-25')),
    PARTITION p2 VALUES LESS THAN (UNIX_TIMESTAMP('2016-10-26')),
    PARTITION p3 VALUES LESS THAN (UNIX_TIMESTAMP('2016-10-27')),
    PARTITION p4 VALUES LESS THAN (UNIX_TIMESTAMP('2016-10-28')),
    PARTITION pMax VALUES LESS THAN MAXVALUE
);

-- View the metadata if you're interested
SELECT * FROM information_schema.partitions WHERE table_name='user_data';

Upvotes: 0

Sergiy Tytarenko
Sergiy Tytarenko

Reputation: 482

Try TokuDB engine at https://www.percona.com/software/mysql-database/percona-tokudb

Archive data are great for TokuDB. You will need about six times less disk space to store AND memory to PROCESS your dataset compared to InnoDB or about 2-3 times less than archived myisam.

Upvotes: 0

Rick James
Rick James

Reputation: 142296

This is a big table, but not unmanageable.

If user_id + dt is UNIQUE, make it the PRIMARY KEY, and get rid if id, thereby saving space. (More in a minute...)

Normalize user_id to a SMALLINT UNSIGNED (2 bytes) or, to be safer MEDIUMINT UNSIGNED (3 bytes). This will save a significant amount of space.

Saving space is important for speed (I/O) for big tables.

PARTITION BY RANGE(TO_DAYS(dt))

with 92 partitions -- the 90 you need, plus 1 waiting to be DROPped and one being filled. See details here .

ENGINE=InnoDB

to get the PRIMARY KEY clustered.

PRIMARY KEY(user_id, dt)

If this is "unique", then it allows efficient access for any time range for a single user. Note: you can remove the "just a day" restriction. However, you must formulate the query without hiding dt in a function. I recommend:

WHERE user_id = ?
  AND dt >= ?
  AND dt  < ? + INTERVAL 1 DAY

Furthermore,

PRIMARY KEY(user_id, dt, id),
INDEX(id)

Would also be efficient even if (user_id, dt) is not unique. The addition of id to the PK is to make it unique; the addition of INDEX(id) is to keep AUTO_INCREMENT happy. (No, UNIQUE(id) is not required.)

INT --> BIGINT UNSIGNED ??

INT (which is SIGNED) will top out at about 2 billion. That will happen in a very few years. Is that OK? If not, you may need BIGINT (8 bytes vs 4).

This partitioning design does not care about your 7-day rule. You may choose to keep the rule and enforce it in your app.

BY HASH

will not work as well.

SUBPARTITION

is generally useless.

Are there other queries? If so they must be taken into consideration at the same time.

Sharding by user_id would be useful if the traffic were too much for a single server. MySQL, itself, does not (yet) have a sharding solution.

Upvotes: 0

user1881277
user1881277

Reputation:

Scaling a database is an unique problem to the application. Most of the times someone else's approach cannot be used as almost all applications writes its data in its own way. So you have to figure out how you are going to manage your data.

Having said that, if your data continue to grow, best solution is the shadring where you can distribute the data across different servers. As long as bound to a single server like creating different tables you are getting hit by resource limits like memory, storage and processing power. Those cannot be increased unlimited manner.

How to distribute the data, that you have to figure out based on your business use cases. As you mentioned, if you are not getting more request on old data, the best way to distribute the data base on date. Like DB for 2016 data, DB for 2015 and so on. Later you may purge or shutdown the servers which you have more old data.

Upvotes: 1

Related Questions