Reputation: 289
Our product is extracts from our database, they can be as large as 300GB+ in file format. To achieve that we join multiple large tables (tables close to 1TB in size in some cases). We do not aggregate data period, it's pure extracts. How does GreenPlum handle these kind of large data sets (The join keys are 3+ column keys and not every table has the same keys to join with, the only common key is the first key and if data would be distributed by that there will be a lot of skew since the data itself is not balanced).
Upvotes: 0
Views: 579
Reputation: 855
In general, Greenplum Database handles this kind of load just fine. The query is executed in parallel on the segments.
Your bottleneck is likely the final export from the database - if you use SQL (or COPY), everything has to go through the master to the client. That takes time, and is slow.
As Jon pointed out, consider using an external table, and write out the data as it comes out of the query. Also avoid any kind of sort operation in your query, if possible. This is unnecessary because the data arrives unsorted in the external table file.
Upvotes: 0
Reputation: 2106
You should use writable external tables for those types of large data extracts because it can leverage gpfdist and write data in parallel. It will be very fast.
https://gpdb.docs.pivotal.io/510/ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html
Also, your use case doesn't really indicate skew. Skew would be either storing the data by a poor column choice like gender_code or processing skew where you filter by a column or columns where only a few segments has the data.
Upvotes: 0