Nika
Nika

Reputation: 63

Does SERIAL8 (BIGSERIAL) in Greenplum cause performance issues with parallel inserts?

I am using Greenplum Database (GPDB) for OLAP workloads, and I have several tables where the primary key is defined as SERIAL8. Since SERIAL8 automatically creates a unique SEQUENCE for each table, I assumed this would work efficiently for parallel inserts. However, I’ve read that in Greenplum, SEQUENCE is managed globally by the master node, which could create performance bottlenecks when multiple segments request new IDs in parallel. Given that Greenplum is an MPP system, I’m concerned that heavy parallel inserts (INSERT INTO ... SELECT ...) could slow down due to SEQUENCE contention.

My questions are:

  1. Does using SERIAL8 (or BIGSERIAL) in Greenplum actually cause performance issues during large-scale parallel inserts?

  2. If yes, what is the best way to optimize it? Should I manually create a SEQUENCE with INCREMENT to distribute IDs across segments, or is there another recommended approach?

  3. Would setting DISTRIBUTED BY (id) help in any way, or does it only affect data distribution and not SEQUENCE contention?

  4. Are there alternative strategies (e.g., UUIDs, pre-generated IDs in ETL, etc.) that are better suited for high-performance OLAP inserts?

I want to ensure that my ID generation strategy does not become a bottleneck as data volume grows. Any insights from experienced Greenplum users would be greatly appreciated!

Upvotes: 1

Views: 38

Answers (1)

Zegarek
Zegarek

Reputation: 26347

If it turns out to be a problem, you can increase the sequence cache setting:[postgres][greenplum]

alter sequence s1 cache 10;

That way each concurrent worker needs to touch the sequence object only once every 10 nextval() requests because it'll get pre-allocated 10 values at once.

The price is that it increases the amount of gaps in the sequence - all pre-allocated and unused values get discarded, they don't go back to be re-used by some other session. Thing is, even with cache 1 you should not rely on a serial column being gapless, nor should you rely on the insertion order following the order of numbers returned by the sequence. That sequence loses values any time something takes a nextval() and rolls back, or upserts with insert..on conflict.

You can also side-step the problem entirely:

create table t1(
  id uuid primary key default gen_random_uuid()
 ,created_at timestamptz default now()
);

By design, gen_random_uuid() makes sure the column is unique and it doesn't matter where the identifier is generated, removing the need for clients to share the sequence object. If you needed some sort of insertion order info, an actual timestamptz is more reliable than a sequence.

If gen_random_uuid() isn't available in your version of Greenplum, you can use the uuid-ossp package.

Upvotes: 1

Related Questions