Reputation: 14578

How to properly count() in PostgreSQL?

I have been reading about "COUNT(*) without where clause" being incredibly slow in PostgreSQL. And coming from MySQL, I don't know if I'm going to be able to live without ever using it again. I have also read that even if you add a "where" clause, it will have to scan each row from the result, which will be slow unless you have very few results. I have seen there are crazy hacks using triggers and extra tables, but I don't like how it looks. I mean I'm new to this RDBMS, just starting to use the basic functionality, and already have to use workarounds?

An example of what I need to do is to create a flood protection mechanism. Something like if "count(id) where ip = 1.2.3.4" > 100, fail instead of insert.

So my questions are:

Is this still an unsolved problem?
Will it make any difference if I use COUNT(id) instead, or add some clause like DISTINCT?
How do PostgreSQL users survive without counting?

Upvotes: 1

Answers (3)

Craig Ringer

Reputation: 324465

This is mostly sorted out in PostgreSQL 9.2 with index-only scans. You should be fine, just make sure your autovacuum is set to run often.

Even in prior versions I would say "incredibly slow" is perhaps over the top unless you've got huge tables or very slow sequential I/O. It did require a seqscan so it wasn't cheap, but things like synchronized scans help a lot.

COUNT(id) is better style, so you should generally prefer to use that anyway. I haven't dug into the performance implications in detail; I think Pg will just use the primary key anyway, but would need to do more digging than I have time for to verify.

Your proposed use is prone to race conditions that will make it ineffective in any database. This might not matter if you don't mind inserting (say) 120 records instead of the planned 100, but if you need it exact you must lock the table first, otherwise many concurrent inserts will all check to see that the count is OK, all see that it is, then all insert a row.

You'll find that in general Pg's counts aren't that different to those of the MVCC, transaction-safe InnoDB storage engine in MySQL.

If you're using MyISAM, well, you get speed in exchange for more problems than I could throw a stick at; I prefer slower count to inability to rollback, data munging to avoid errors as a workaround for inability to rollback, lack of crash-safety, etc.

Upvotes: 2

Clodoaldo Neto

Reputation: 125244

In this case you can limit the count:

select count(id) > 100
where id = 1.2.3.4
limit 101

It will stop counting at 101 regardless of the number of rows matching the criteria, say, 100,000. If count(id) > 100 it will return true else false.

You can combine that test with the insert command so you don't need a round trip to the server:

with c as (
    select count(id) > 100 as c
    from t
    where id = 1.2.3.4
    limit 101
)
insert into t (x, y)
select 1, 2
where (select c from c)

Upvotes: 1

Sergey

Reputation: 12417

I think "incredibly slow" is a huge exaggeration :) In your example you have maximum 100 rows in the result, counting them will be very-very fast. I'm pretty sure you don't need any hacks or workarounds in this case.

As I understand, "COUNT(*) without WHERE clause is slower in Postgres" refers to the fact that MySQL's MyISAM tables store the total number of rows in the table header, so doing SELECT COUNT(*) FROM my_table is an O(1) operation - just read the value from the header and that's it. In Postgres, because of MVCC and transactions this can not be done and it has to scan each row to determine if it is visible in the current transaction.

However, if you use WHERE clause in your query, MySQL no longer can read the row count from the header and actually has to count the rows. I don't think there is a huge difference in performance with Postgres in this case.

Upvotes: 3

How to properly count() in PostgreSQL?

Answers (3)

Related Questions