Silver Light
Silver Light

Reputation: 45912

Approaches to gathering large visiting statistics

I have website, where users can post their articles and I would like to give full stats about each articles visits and referrers to it's author. Realization seems quite straight forward here, just store a database record for every visit and then use aggregate functions to draw graphs and so on.

The problem is, that articles receive about 300k views in 24 hours and just in a month, stats table will get about 9 million records which is a very big number, because my server isn't quite powerful.

Is there a solution to this kind of task? Is there an algorithm or caching mechanism that allows to store long term statistics without losing accuracy?

P.S. Here is my original stats table:

visitid INT
articleid INT
ip INT
datetime DATETIME

Upvotes: 3

Views: 307

Answers (2)

mjv
mjv

Reputation: 75125

Assuming a home-brewed usage-tracking solution (as opposed to say GA as suggested in other response), a two databases setup may be what you are looking for:

  • a "realtime" database which captures the vist events as they come.
  • an "offline" database where the data from the "realtime" database is collected on a regular basis, for being [optionally] aggregated and indexed.

The purpose for this setup is mostly driven by operational concerns. The "realtime" database is not indexed (or minimally indexed), for fast insertion, and it is regularly emptied, typically each night, when the traffic is lighter, as the "offline" database picks up the events collected through the day.

Both databases can have the very same schema, or the "offline" database may introduce various forms of aggregation. The specific aggregation details applied to the offline database can vary greatly depending on the desire to keep the database's size in check and depending on the data which is deemed important (most statistics/aggregation functions introduce some information loss, and one needs to decide which losses are acceptable and which are not).

Because of the "half life" nature of the value of usage logs, whereby the relative value of details decays with time, a common strategy is to aggregate info in multiple tiers, whereby the data collected in the last, say, X days remains mostly untouched, the data collected between X and Y days is partially aggregated, and finally, data older than Y days only keep the most salient info (say, number of hits).

Upvotes: 6

UtopiaLtd
UtopiaLtd

Reputation: 2590

Unless you're particularly keen on storing your statistical data yourself, you might consider using Google Analytics or one of its modern counterparts, which are much better than the old remotely hosted hit counters of the 90s. You can find the API to the Google Analytics PHP interface at http://code.google.com/p/gapi-google-analytics-php-interface/

Upvotes: 1

Related Questions