MarathonStudios
MarathonStudios

Reputation: 2979

What's the most space-efficient way to store large amounts of text in MySQL?

I'm writing a webcrawler in Python that will store the HTML code of a large set of pages in a MySQL database. I'd like to make sure my methods of storage and processing are optimal before I begin processing data. I would like to:

Does anyone have any suggestions or experience in this or a similar issue? Is Python the optimal language to be doing this in, given that it's going to require a number of HTTP requests and regular expressions plus whatever compression is optimal?

Upvotes: 1

Views: 1006

Answers (2)

barryhunter
barryhunter

Reputation: 21081

Do you actully need to store the source in the database?

Trying to run 'LIKE' queries against the data is going to suck big time anyway.

Store the raw data on the file system, as standard files. Just dont stick them all in one folder. use hashes of the id, to store them in predictable folders.

(while of course it is perfectly possible to store the text in the database, it bloats the size of your database, and makes it harder to work with. backups are (much!) bigger, changing storage engine, becomes more painful etc. Scaling your filesystem, is usually just a case of adding another harddisk. That doesnt work so easily with a database - you start needing to shard)

... to do any sort of searching on the data, you looking at building an index. I only have experience with SphinxSearch, but that allows you to specify a filename in the input database.

Upvotes: 0

Ted Hopp
Ted Hopp

Reputation: 234857

If you don't mind the HTML being opaque to MySQL, you can use the COMPRESS function to store the data and UNCOMPRESS to retrieve it. You won't be able to use the HTML contents in a WHERE clause (using, e.g., LIKE).

Upvotes: 1

Related Questions