Reputation: 1296
Greenplum Database version: PostgreSQL 8.2.15 (Greenplum Database 4.2.3.0 build 1)
SQL Server Database version: Microsoft SQL Server 2008 R2 (SP1)
Our current approach:
1) Export each table to a flat file from SQL Server
2) Load the data into Greenplum with pgAdmin III using PSQL Console's psql.exe utility
Benifits...
Speed: OK, but is there anything faster? We load millions of rows of data in minutes
Automation: OK, we call this utility from an SSIS package using a Shell script in VB
Pitfalls...
Reliability: ETL is dependent on the file server to hold the flat files
Security: Lots of potentially sensitive data on the file server
Error handling: It's a problem. psql.exe never raises an error that we can catch even if it does error out and loads no data or a partial file
What else we have tried...
.Net Providers\Odbc Data Provider: We have configured a System DSN using DataDirect 6.0 Greenplum Wire Protocol. Good performance for a DELETE. Dog awful slow for an INSERT.
For reference, this is the aforementioned VB script in SSIS...
Public Sub Main()
Dim v_shell
Dim v_psql As String
v_psql = "C:\Program Files\pgAdmin III\1.10\psql.exe -d "MyGPDatabase" -h "MyGPHost" -p "5432" -U "MyServiceAccount" -f \\MyFileLocation\SSIS_load\sql_files\load_MyTable.sql"
v_shell = Shell(v_psql, AppWinStyle.NormalFocus, True)
End Sub
This is the contents of the "load_MyTable.sql" file...
\copy MyTable from '\\MyFileLocation\SSIS_load\txt_files\MyTable.txt' with delimiter as ';' csv header quote as '"'
Upvotes: 0
Views: 2175
Reputation: 5236
If you're getting your data load done in minutes, then the current method is probably good enough. However, if you find yourself having to load larger volumes of data (terabyte scale for instance), the usual preferred method for bulk-loading into Greenplum is via gpfdist
and corresponding EXTERNAL TABLE
definitions. gpload
is a decent wrapper that provides abstraction over much of this process and is driven by YAML control files. The general idea is that gpfdist
instance(s) are spun up at the location(s) where your data is staged, preferrably as CSV text files, and then the EXTERNAL TABLE
definition within Greenplum is made aware of the URIs for the gpfdist
instances. From the admin guide, a sample definition of such an external table could look like this:
CREATE READABLE EXTERNAL TABLE students (
name varchar(20), address varchar(30), age int)
LOCATION ('gpfdist://<host>:<portNum>/file/path/')
FORMAT 'CUSTOM' (formatter=fixedwidth_in,
name=20, address=30, age=4,
preserve_blanks='on',null='NULL');
The above example expects to read text files whose fields from left to right are a 20-character (at most) string, a 30-character string, and an integer. To actually load this data into a staging table inside GP:
CREATE TABLE staging_table AS SELECT * FROM students;
For large volumes of data, this should be the most efficient method since all segment hosts are engaged in the parallel load. Do keep in mind that the simplistic approach above will probably result in a randomly distributed table, which may not be desirable. You'd have to customize your table definitions to specify a distribution key.
Upvotes: 1