Emmanuel
Emmanuel

Reputation: 254

Doing analytics on a large table in Rails / PostGreSQL

I have a "Vote" table in my database which is growing in size everyday, currently at around 100 million rows. For internal analytics / insights I used to have a rake task which would compute a few basic metrics, like the number of votes made daily in the past few days. It's just a COUNT with a where clause on the date "created_at".

This rake task was doing fine until I deleted the index on "created_at" because it seems that it had a negative impact on the app performance for all the other user-facing queries that didn't need this index, especially when inserting a new row.

Currently I don't have a lot of insights as to what is going on in my app and in this table. However I don't really want to add indexes on such a large table if it's only for my own use.

What else can I try ?

Upvotes: 1

Views: 537

Answers (3)

Substantial
Substantial

Reputation: 6682

Alternately, you could sidestep the Vote table altogether and keep an external tally.

Every time a vote is cast, a separate tally class that keeps a running count of votes cast will be invoked. There will be one tally record per day. A tally record will have an integer representing the number of votes cast on that day.

Each increment call to the tally class will find a tally record for the current date (today), increment the vote count, and save the record. If no record exists, one will be created and incremented accordingly.

For example, let's have a class called VoteTally with two attributes: a date (date), and a vote count (integer), no timestamps, no associations. Here's what the model will look like:

class VoteTally < ActiveRecord::Base

  def self.tally_up!
    find_or_create_by_date(Date.today).increment!(:votes)
  end

  def self.tally_down!
    find_or_create_by_date(Date.today).decrement!(:votes)
  end

  def self.votes_on(date)
    find_by_date(date).votes
  end

end

Then, in the Vote model:

class Vote < ActiveRecord::Base
  after_create :tally_up
  after_destroy :tally_down

  # ...

  private

  def tally_up ; VoteTally.tally_up! ; end
  def tally_down ; VoteTally.tally_down! ; end

end

These methods will get vote counts:

VoteTally.votes_on Date.today
VoteTally.votes_on Date.yesterday
VoteTally.votes_on 3.days.ago
VoteTally.votes_on Date.parse("5/28/13")

Of course, this is a simple example and you will have to adapt it to suit. This will result in an extra query during vote casting, but it's a hell of a lot faster than a where clause on 100M records with no index. Minor inaccuracies are possible with this solution, but I assume that's acceptable given the anecdotal nature of daily vote counts.

Upvotes: 1

Beryllium
Beryllium

Reputation: 12998

If the index has really an impact on the write performance, and it's only a few persons which run statistics now and then, you might consider another general approach:

You could separate your "transaction processing database" from your "reporting database".

You could update your reporting database on a regular basis, and create reporting-only indexes only there. What is more queries regarding reports will not conflict with transaction-oriented traffic, and it doesn't matter how long they run.

Of course, this increases a certain delay, and it increases system complexity. On the other hand, if you roll-forward your reporting database on a regular basis, you can ensure that your backup scheme actually works.

Upvotes: 0

Denis de Bernardy
Denis de Bernardy

Reputation: 78463

It's just a COUNT with a where clause on the date "created_at".

In that case the only credible index you can use is the one on created_at...

If write performance is an issue (methinks it's unlikely...) and you're using a composite primary key, clustering the table using that index might help too.

Upvotes: 0

Related Questions