Reputation: 8903

Performance of queries using count(*) on tables with many rows (300 million+)

I understand there are limitations to using sqlite, but I'd like to know if it should be able to handle this scenario.

My table has over 300 million records and the db is about 12 gigs. The data import util with sqlite is nice and fast. But then I added an index to a string column in this table, and it ran all night to complete this operation. I haven't compared this to other db's, but seemed quite slow to me.

Now that my index is added, I'm wanting to look for duplicates in the data. So I'm trying to run a "having count > 0" query and it seems to be taking hours as well. My query looks like:

select col1, count(*) 
from table1
group by col1
having count(*) > 1

I would assume this query would use my index on col1, but the slow query execution makes me wonder if it is not?

Would perhaps sql server handle this kind of thing better?

Upvotes: 0

Answers (3)

goTo-devNull

Reputation: 9372

SQLite's count() isn't optimized - it does a full table scan even if indexed. Here is the recommended approach to speed things up. Run EXPLAIN QUERY PLAN to verify and you'll see:

EXPLAIN QUERY PLAN SELECT COUNT(FIELD_NAME) FROM TABLE_NAME;

I get something like this:

0|0|0|SCAN TABLE TABLE_NAME (~1000000 rows)

Upvotes: 3

Dan D.

Reputation: 74655

increase the sqlite cache via PRAGMA cache_size=<number of pages>. the memory used is <number of pages> times <size of page>. (which can be set via PRAGMA page_size=<size of page>)

by setting those values to 16000 and 32768 respectively (or about 512MB), i was able to get this one program's bulk load down from 20mins to 2mins. (although i think that if the disk on that system wasn't so slow, this might not have had as much effect)

but you might not have this extra memory available on lesser embedded platforms, i don't recommend increasing it as much as i did on those, but for desktop or laptop level systems it can greatly help.

Upvotes: 0

TomTom

Reputation: 62101

But then I added an index to a string column in this table, and it ran all night to complete this operation. I haven't compared this to other db's, but seemed quite slow to me.

I hate to tell yuo, but how does your server look like? Not arguing, but that is a possibly very resoruce intensive operation that may require a lot of IO and normal computers or chehap web servers with a slow hard disc are not suited for significant database work. I run hundreds og gigabyte db project work and my smallest "large data" server has 2 SSD and 8 Velociraptors for data and log. The largest one has 3 storage nodes with a total of 1000gb SSD discs - simply because IO is what a db server lives and breathes on.

So I'm trying to run a "having count > 0" query and it seems to be taking hours as well

How much RAM? ENough to fit it all in memory, or a low memory virtual server where the missing memory blows up to bad IO? How much memory can / does SqlLite use? How is the temp setup? In memory? Sql server would possibly use a lot of memory / tempdb space for this type of check.

Upvotes: 1

Performance of queries using count(*) on tables with many rows (300 million+)

Answers (3)

Related Questions