Reputation: 924

Billions rows in PostgreSql: partition or not to partition?

What i have:

Simple server with one xeon with 8 logic cores, 16 gb ram, mdadm raid1 of 2x 7200rpm drives.
PostgreSql
A lot of data to work with. Up to 30 millions of rows are being imported per day.
Time - complex queries can be executed up to an hour

Simplified schema of table, that will be very big:

id| integer | not null default nextval('table_id_seq'::regclass)
url_id      | integer | not null
domain_id   | integer | not null
position    | integer | not null

The problem with the schema above is that I don't have the exact answer on how to partition it. Data for all periods is going to be used (NO queries will have date filters).

I thought about partitioning on "domain_id" field, but the problem is that it is hard to predict how many rows each partition will have.

My main question is:

Does is make sense to partition data if i don't use partition pruning and i am not going to delete old data?

What will be pros/cons of that ?

How will degrade my import speed, if i won't do partitioning?

Another question related to normalization:

Should url be exported to another table?

Pros of normalization

Table is going to have rows with average size of 20-30 bytes.
Joins on "url_id" are supposed to be much faster than on "url" field

Pros of denormalization

Data can be imported much, much faster, as i don't have to make lookup into "url" table before each insert.

Can anybody give me any advice? Thanks!

Upvotes: 6

Answers (3)

kgrittn

Reputation: 19471

Partitioning is most useful if you are going to either have selection criteria in most queries which allow the planner to skip access to most of the partitions most of the time, or if you want to periodically purge all rows that are assigned to a partition, or both. (Dropping a table is a very fast way to delete a large number of rows!) I have heard of people hitting a threshold where partitioning helped keep indexes shallower, and therefore boost performance; but really that gets back to the first point, because you effectively move the first level of the index tree to another place -- it still has to happen.

On the face of it, it doesn't sound like partitioning will help.

Normalization, on the other hand, may improve performance more than you expect; by keeping all those rows narrower, you can get more of them into each page, reducing overall disk access. I would do proper 3rd normal form normalization, and only deviate from that based on evidence that it would help. If you see a performance problem while you still have disk space for a second copy of the data, try creating a denormalized table and seeing how performance is compared to the normalized version.

Upvotes: 12

vyegorov

Reputation: 22855

With the given amount of data in mind, you'll be waiting on IO mostly. If possible, perform some tests with different HW configurations trying to get best IO figures for your scenarios. IMHO, 2 disks will not be enough after a while, unless there's something else behind the scenes.

Your table will be growing daily with a known ratio. And most likely it will be queried daily. As you haven't mentioned data being purged out (if it will be, then do partition it), this means that queries will run slower each day. At some point in time you'll start looking at how to optimize your queries. One of the possibilities is to parallelize query on the application level. But here some conditions should be met:

your table should be partitioned in order to parallelize queries;
HW should be capable of delivering the requested amount of IO in N parallel streams.

All answers should be given by the performance tests of different setups.

And as others mentioned, there're more benefits for DBA in partitioned tables, so I, personally, would go for partitioning any table that is expected to receive more then 5M rows per interval, be it day, week or month.

Upvotes: 0

duffymo

Reputation: 308743

I think it makes sense, depending on your use cases. I don't know how far back in time your 30B row history goes, but it makes sense to partition if your transactional database doesn't need more than a few of the partitions you decide on.

For example, partitioning by month makes perfect sense if you only query for two months' worth of data at a time. The other ten months of the year can be moved into a reporting warehouse, keeping the transactional store smaller.

There are restrictions on the fields you can use in the partition. You'll have to be careful with those.

Get a performance baseline, do your partition, and remeasure to check for performance impacts.

Upvotes: 1

Billions rows in PostgreSql: partition or not to partition?

Answers (3)

Related Questions