Reputation: 1752

Perl caching library for large datasets?

I am looking for a Perl library to handle caching of DB queries, but it needs to handle a much larger cache than the typical application. It needs to:

cache around 200,000 records at once, averaging maybe two MB or so each (so, total cache size of around 400GB)
have no maximum record size (or at least a fairly large one, like several GB)
be size-aware, so it automatically deletes the oldest (in terms of last access time) records when the total storage gets above a preset maximum
be as fast as possible given the above requirements

The libraries I've looked at so far are CHI and Cache::SizeAwareFileCache (extension of Cache::Cache).

The main concern I have with CHI is that I would need to use CHI::Driver::File with is_size_aware turned on, but the documentation specifically warns against this:

...for drivers that cannot atomically read and update a value - for example, CHI::Driver::File - there is a race condition in the updating of size that can cause the size to grow inaccurate over time.

The main concern I have with Cache::SizeAwareFileCache is that Cache::Cache is old and not currently maintained. One of the first things I see in the documentation is a section that advises me to use CHI instead.

Any recommendations? Should I be using either of these two libraries, or something else? Am I crazy for wanting to use caching for this at all? Does anyone have experience with similar requirements? I would be grateful for any advice.

Some details about the application:

I have an application that analyzes large websites to look for hard-to-find errors/inefficiencies in the HTML code, often buried among hundreds of thousands of pages. The application crawls an entire website and stores the HTML code of each page in the DB. (a MySQL server running on a separate machine) When the crawl is complete, the user can run various software tools to analyze the HTML of each page on the site.

The tools wait in a queue, and run one at a time. Each tool needs to load the HTML of every page in the crawl, always in the same order. So, if the crawl grabbed 100,000 pages, and the user needs to run 15 different tools on it, then the cache needs to hold at least 100,000 records, each of which will be read 15 times. It is critical that the cache be able to store all of the pages from a given site at the same time. (otherwise every page will get dropped and then re-cached again for each tool, which would be worse than no caching at all)

The biggest goal is to reduce the load on the database. The secondary (but still very important) goal is to improve the speed.

Upvotes: 2

Answers (2)

David Raab

Reputation: 4488

Instead of using a module that implements caching on its own i would suggest using something like Memcached. And then use one of the Perl Binding like Cache::Memcached, CHI::Driver::Memcached, Memcached::Client or probably others.

Upvotes: 1

Ken Cheung

Reputation: 1808

Perl works better on files than DB. If you have 400GB HTML code inside 200,000 pages (i.e. approximately 2MB per HTML file), instead of placing the 400GB data on DB, read again into cache (eventually write to the disk again), why not the HTML content on disk and DB record only links with the file path?

After having more and more "pages" and "tools" to carry analysis, you may want to have more analyzing machines. You cannot keep a 400GB DB in sync after cache. Keep DB small and efficient, and duplicate the files to local disk of each analyzing machine for direct access (fastest possible). For different tools that do not have dependency and update different fields on the DB record, they can run concurrently. For tools with dependency, its up to your workflow design.

Upvotes: 3

Perl caching library for large datasets?

Answers (2)

Related Questions