Reputation: 394

Parallel Data Loading in Greenplum

Greenplum says that it has parallel data loading. I have a doubt regarding how it works. Please do explain it to me. I understand that records are read in parallel but I can't understand how parallel writes are done. Is the parallel writes done on the same database or is it done on different databases(segments)? Please do explain. Thanks

Upvotes: 0

Answers (3)

Brendan Stephens

Reputation: 227

John is correct.

Each instance of gpfdist, by default, will handle 4 concurrent connections. When loading, each segment with a connection will read their "chunk" of data and process according the distribution hash of the table.

See: https://blog.2ndquadrant.com/parallel_etl_with_greenplum/

Upvotes: 0

leonkhu

Reputation: 11

Concurrent reads/writes can be done at segment level with the help of gpfdist or gphdfs.

For example, if you want to unload data to a file on disk, you can use a writable external table which connects to several gpfdist locations, and each data segment would write data to those destinations is parallel.

Upvotes: 0

John Percival Hackworth

Reputation: 11531

The parallel writes are done on different segments, with data being fed by 1 or more instances of gpfdist running on the ETL server(s). I suspect a significant part of the magic is the distributed by extension that is used to scatter the rows of a database across the segment servers.

Upvotes: 1

Parallel Data Loading in Greenplum

Answers (3)

Related Questions