Reputation: 394
Greenplum says that it has parallel data loading. I have a doubt regarding how it works. Please do explain it to me. I understand that records are read in parallel but I can't understand how parallel writes are done. Is the parallel writes done on the same database or is it done on different databases(segments)? Please do explain. Thanks
Upvotes: 0
Views: 686
Reputation: 227
John is correct.
Each instance of gpfdist, by default, will handle 4 concurrent connections. When loading, each segment with a connection will read their "chunk" of data and process according the distribution hash of the table.
See: https://blog.2ndquadrant.com/parallel_etl_with_greenplum/
Upvotes: 0
Reputation: 11
Concurrent reads/writes can be done at segment level with the help of gpfdist or gphdfs.
For example, if you want to unload data to a file on disk, you can use a writable external table which connects to several gpfdist locations, and each data segment would write data to those destinations is parallel.
Upvotes: 0
Reputation: 11531
The parallel writes are done on different segments, with data being fed by 1 or more instances of gpfdist running on the ETL server(s). I suspect a significant part of the magic is the distributed by
extension that is used to scatter the rows of a database across the segment servers.
Upvotes: 1