Jon Jaussi
Jon Jaussi

Reputation: 1296

Most effective way to push data from a SQL Server database into a Greenplum database?

Greenplum Database version: PostgreSQL 8.2.15 (Greenplum Database 4.2.3.0 build 1)

SQL Server Database version: Microsoft SQL Server 2008 R2 (SP1)

Our current approach:

1) Export each table to a flat file from SQL Server

2) Load the data into Greenplum with pgAdmin III using PSQL Console's psql.exe utility


Benifits...


Pitfalls...


What else we have tried...

.Net Providers\Odbc Data Provider: We have configured a System DSN using DataDirect 6.0 Greenplum Wire Protocol. Good performance for a DELETE. Dog awful slow for an INSERT.

For reference, this is the aforementioned VB script in SSIS...

Public Sub Main()

    Dim v_shell
    Dim v_psql As String


    v_psql = "C:\Program Files\pgAdmin III\1.10\psql.exe -d "MyGPDatabase" -h "MyGPHost" -p "5432" -U "MyServiceAccount" -f \\MyFileLocation\SSIS_load\sql_files\load_MyTable.sql"

    v_shell = Shell(v_psql, AppWinStyle.NormalFocus, True)

End Sub


This is the contents of the "load_MyTable.sql" file...

\copy MyTable from '\\MyFileLocation\SSIS_load\txt_files\MyTable.txt' with delimiter as ';' csv header quote as '"'

Upvotes: 0

Views: 2175

Answers (1)

rchang
rchang

Reputation: 5236

If you're getting your data load done in minutes, then the current method is probably good enough. However, if you find yourself having to load larger volumes of data (terabyte scale for instance), the usual preferred method for bulk-loading into Greenplum is via gpfdist and corresponding EXTERNAL TABLE definitions. gpload is a decent wrapper that provides abstraction over much of this process and is driven by YAML control files. The general idea is that gpfdist instance(s) are spun up at the location(s) where your data is staged, preferrably as CSV text files, and then the EXTERNAL TABLE definition within Greenplum is made aware of the URIs for the gpfdist instances. From the admin guide, a sample definition of such an external table could look like this:

CREATE READABLE EXTERNAL TABLE students (
name varchar(20), address varchar(30), age int)
LOCATION ('gpfdist://<host>:<portNum>/file/path/')
FORMAT 'CUSTOM' (formatter=fixedwidth_in,
name=20, address=30, age=4,
preserve_blanks='on',null='NULL');

The above example expects to read text files whose fields from left to right are a 20-character (at most) string, a 30-character string, and an integer. To actually load this data into a staging table inside GP:

CREATE TABLE staging_table AS SELECT * FROM students;

For large volumes of data, this should be the most efficient method since all segment hosts are engaged in the parallel load. Do keep in mind that the simplistic approach above will probably result in a randomly distributed table, which may not be desirable. You'd have to customize your table definitions to specify a distribution key.

Upvotes: 1

Related Questions