Considerations for very large SQL tables?

Question

I'm building, basically, an ad server. This is a personal project that I'm trying to impress my boss with, and I'd love any form of feedback about my design. I've already implemented most of what I describe below, but it's never too late to refactor :)

This is a service that delivers banner ads (http://myserver.com/banner.jpg links to http://myserver.com/clicked) and provides reporting on subsets of the data.

For every ad impression served and every click, I need to record a row that has (id, value) [where value is the cash value of this transaction; e.g. -$.001 per served banner ad at $1 CPM, or +$.25 for a click); my output is all based on earnings per impression [abbreviated EPC]: (SUM(value)/COUNT(impressions)), but on subsets of the data, like "Earnings per impression where browser == 'Firefox'". The goal is to output something like "Your overall EPC is $.50, but where browser == 'Firefox', your EPC is $1.00", so that the end user can quickly see significant factors in their data.

Because there's a very large number of these subsets (tens of thousands), and reporting output only needs to include the summary data, I'm precomputing the EPC-per-subset with a background cron task, and storing these summary values in the database. Once in every 2-3 hits, a Hit needs to query the Hits table for other recent Hits by a Visitor (e.g. "find the REFERER of the last Hit"), but usually, each Hit only performs an INSERT, so to keep response times down, I've split the app across 3 servers [bgprocess, mysql, hitserver].

Right now, I've structured all of this as 3 normalized tables: Hits, Events and Visitors. Visitors are unique per visitor session, a Hit is recorded every time a Visitor loads a banner or makes a click, and Events map the distinct many-to-many relationship from Visitors to Hits (e.g. an example Event is "Visitor X at Banner Y", which is unique, but may have multiple Hits). The reason I'm keeping all the hit data in the same table is because, while my above example only describes "Banner impressions -> clickthroughs", we're also tracking "clickthroughs -> pixel fires", "pixel fires -> second clickthrough" and "second clickthrough -> sale page pixel".

My problem is that the Hits table is filling up quickly, and slowing down ~linearly with size. My test data only has a few thousand clicks, but already my background processing is slowing down. I can throw more servers at it, but before launching the alpha of this, I want to make sure my logic is sound.

So I'm asking you SO-gurus, how would you structure this data? Am I crazy to try to precompute all these tables? Since we rarely need to access Hit records older than one hour, would I benefit to split the Hits table into ProcessedHits (with all historical data) and UnprocessedHits (with ~last hour's data), or does having the Hit.at Date column indexed make this superfluous?

This probably needs some elaboration, sorry if I'm not clear, I've been working for past ~3 weeks straight on it so far :) TIA for all input!

RickNZ · Accepted Answer

You should be able to build an app like this in a way that it won't slow down linearly with the number of hits.

From what you said, it sounds like you have two main potential performance bottlenecks. The first is inserts. If you can have your inserts happen at the end of the table, that will minimize fragmentation and maximize throughput. If they're in the middle of the table, performance will suffer as fragmentation increases.

The second area is the aggregations. Whenever you do a significant aggregation, be careful that you don't cause all in-memory buffers to get purged to make room for the incoming data. Try to minimize how often the aggregations have to be done, and be smart about how you group and count things, to minimize disk head movement (or maybe consider using SSDs).

You might also be able to do some of the accumulations at the web tier based entirely on the incoming data rather than on new queries, perhaps with a fallback of some kind if the server goes down before the collected data is written to the DB.

Are you using INNODB or MyISAM?

Here are a few performance principles:

Minimize round-trips from the web tier to the DB
Minimize aggregation queries
Minimize on-disk fragmentation and maximize write speeds by inserting at the end of the table when possible
Optimize hardware configuration

Considerations for very large SQL tables?

Answers (2)

Related Questions