cj333
cj333

Reputation: 2609

How can I segment a mySQL fulltext table?

I have a mysql table with fulltext. Every day its size will increase by 30K data items. I am afraid the query speed will decrease when the data size increases to many GB. My situation permits me to limit the search to the most recent one million data items.

First I tried partitioning table, but mysql does not supprt fulltext and partition at the same time.

http://dev.mysql.com/doc/refman/5.5/en/partitioning-limitations.html

Some friends suggest I use another dbms product, such as sphinx, solr, or oracle. But I'd rather use mySQL, for budget reasons. (Forgive my stubbornness).

So, how can I optimize mysql fulltext table searches? how can I keep an acceptable query speed as the volume of my data scales up?

  1. divide table? (then make query complex)

  2. only limit query by id? SELECT * FROM table WHERE (MATCH (some_field) AGAINST ('+search_words' IN BOOLEAN MODE)) AND id>last_id-1000000 AND id<last_id Order By date DESC

  3. another good suggestion? Thanks.

Upvotes: 2

Views: 987

Answers (1)

O. Jones
O. Jones

Reputation: 108796

Here's a way to handle your problem. It's not super-beautiful but it will be quite efficient.

Create two copies of your table. Call one something like text_current and the second something like text_archive.

Load your newly arriving rows of data into text_current. That way your current data will be available for search.

Write two stored procedures: one called something like copy_text that selects older rows from text_current and inserts them into text_archive, and another called something like purge_text that deletes the older rows from text_current. Then, run those two stored procedures from mySql jobs.

If I were you I would run the copy_text job once every night at 03:00, and copy every item that arrived any time in the previous day (midnight to midnight).

I would run the purge_text job once a week, and purge everything from text_current that is more than 35 days old. (It takes 33.3 days to get a million records at 30K records a day).

This method of migrating your data from a current table to an archive table has some handy features.

  1. It's safe -- most records are in both the current and archived tables.
  2. It's deterministic -- using dates rather than ids lets you select items for archiving precisely.
  3. It offers, for the future, if you need it, a slow but effective "search the archives" feature in your product.
  4. It keeps the size of your current data table relatively workable while allowing you the use of mySql's fine full text search feature.
  5. It is efficient -- the hard work of migrating is done in one go each day, rather than continuously by mucking about with id numbers in your queries.

Here's how to select records by date, when the record_date column is a unix time stamp. If you run this on any given day, it will select records from midnight on the previous day, up to but not including records at midnight on the present day.

...
WHERE text_current.record_date >= UNIX_TIMESTAMP(DATE_SUB(CURDATE(), INTERVAL 1 DAY))
  AND text_current.record_date < UNIX_TIMESTAMP(CURDATE())
...

It doesn't matter at what time of day you run this query, it always runs midnight to midnight. That's the beauty of CURDATE() -- it means midnight today. (NOW() means, well, now; it's like CURDATE() but also has a time. But you don't want that because you want to do your archiving midnight-to-midnight. If you use NOW() you'll get caught by small unpredictabilities in when you run your stored procedures.

(Don't be tempted to use the BETWEEN operator for time ranges; it's inclusive and you want the end time of your time range not to be inclusive.)

Similarly, if you want to delete everything that's older than 35 days, do this.

DELETE 
 FROM text_current
 WHERE text_current.record_date < UNIX_TIMESTAMP(DATE_SUB(CURDATE(), INTERVAL 35 DAY))

See how this goes? When you do your archiving and purging, you're always selecting records based on the DAY they were inserted, regardless of the TIME they were inserted. That's what I mean by deterministic.

Upvotes: 5

Related Questions