Werner
Werner

Reputation: 95

fastparquet export for Redshift

I had a very simple idea: Use Python Pandas (for convenience) to do some simple database operations with moderate data amounts and write the data back to S3 in Parquet format. Then, the data should be exposed to Redshift as an external table in order to not take storage space from the actual Redshift cluster.

I found two ways to that.


Given the data:

data = {
    'int': [1, 2, 3, 4, None],
    'float': [1.1, None, 3.4, 4.0, 5.5],
    'str': [None, 'two', 'three', 'four', 'five'],
    'boolean': [True, None, True, False, False],
    'date': [
        date(2000, 1, 1),
        date(2000, 1, 2),
        date(2000, 1, 3),
        date(2000, 1, 4),
        None,
    ],
    'timestamp': [
        datetime(2000, 1, 1, 1, 1, 1),
        datetime(2000, 1, 1, 1, 1, 2),
        None,
        datetime(2000, 1, 1, 1, 1, 4),
        datetime(2000, 1, 1, 1, 1, 5),
    ]
}

df = pd.DataFrame(data)

df['int'] = df['int'].astype(pd.Int64Dtype())
df['date'] = df['date'].astype('datetime64[D]')
df['timestamp'] = df['timestamp'].astype('datetime64[s]')

The type casts at the end are necessary in both cases to assert, that Pandas' type recognition does not interfere.

With PyArrow:

Using Pyarrow, you do it like this:

import pyarrow as pa

pyarrow_schema = pa.schema([
    ('int', pa.int64()),
    ('float', pa.float64()),
    ('str', pa.string()),
    ('bool', pa.bool_()),
    ('date', pa.date64()),
    ('timestamp', pa.timestamp(unit='s'))
])

df.to_parquet(
    path='pyarrow.parquet',
    schema=pyarrow_schema,
    engine='pyarrow'
)

Why use PyArrow: Pandas' default engine for Parquet export is PyArrow, so you can expect good integration. Also, PyArrow provides extensive features and caters for many datatypes.

With fastparquet:

First you need to write out the data with these additional steps:

from fastparquet import write

write('fast.parquet', df, has_nulls=True, times='int96')

The important bit here is the 'times' parameter. See this post, where I found a remedy for the 'date' column.

Why use fastparquet: Fastparquet is much more limited than PyArrow, especially, when it comes to accepted datatypes. On the other hand, the package is much smaller.

The external table:

Given, that you have exported your data to Parquet and stored it in S3, you can then expose it to Redshift like this:

CREATE EXTERNAL TABLE "<your_external_schema>"."<your_table_name>" (
 "int" bigint,
 "float" float,
 "str" varchar(255),
 "boolean" bool,
 "date" date,
 "timestamp" timestamp)
 ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location 
  's3://<your_bucket>/<your_prefix>/';

Final story and note:

When I started working with Pandas, Parquet and external Redshift tables in the context of AWS Lambda functions, everything was fine for a while. Until I reached a point, where the bundle for my Lambda package reached its allowed limit (Deployment package size). Checking, which of my dependencies made up for all that, I found PyArrow, Pandas and Numpy (dependency of Pandas) to be the culprits. While I could definitely not drop Numpy (for efficiency) and did not want to loose Pandas (convenience, again), I looked to replace PyArrow with something more light-weight. Et voila: Fastparquet. After some research and a lot of experimentation, I could make this also work.

I hope, some other people find this explanation and resources helpful.

Upvotes: 1

Views: 608

Answers (1)

Werner
Werner

Reputation: 95

The question already holds the answer. :)

Upvotes: 1

Related Questions