Reputation: 95
I had a very simple idea: Use Python Pandas (for convenience) to do some simple database operations with moderate data amounts and write the data back to S3 in Parquet format. Then, the data should be exposed to Redshift as an external table in order to not take storage space from the actual Redshift cluster.
I found two ways to that.
Given the data:
data = {
'int': [1, 2, 3, 4, None],
'float': [1.1, None, 3.4, 4.0, 5.5],
'str': [None, 'two', 'three', 'four', 'five'],
'boolean': [True, None, True, False, False],
'date': [
date(2000, 1, 1),
date(2000, 1, 2),
date(2000, 1, 3),
date(2000, 1, 4),
None,
],
'timestamp': [
datetime(2000, 1, 1, 1, 1, 1),
datetime(2000, 1, 1, 1, 1, 2),
None,
datetime(2000, 1, 1, 1, 1, 4),
datetime(2000, 1, 1, 1, 1, 5),
]
}
df = pd.DataFrame(data)
df['int'] = df['int'].astype(pd.Int64Dtype())
df['date'] = df['date'].astype('datetime64[D]')
df['timestamp'] = df['timestamp'].astype('datetime64[s]')
The type casts at the end are necessary in both cases to assert, that Pandas' type recognition does not interfere.
Using Pyarrow, you do it like this:
import pyarrow as pa
pyarrow_schema = pa.schema([
('int', pa.int64()),
('float', pa.float64()),
('str', pa.string()),
('bool', pa.bool_()),
('date', pa.date64()),
('timestamp', pa.timestamp(unit='s'))
])
df.to_parquet(
path='pyarrow.parquet',
schema=pyarrow_schema,
engine='pyarrow'
)
Why use PyArrow: Pandas' default engine for Parquet export is PyArrow, so you can expect good integration. Also, PyArrow provides extensive features and caters for many datatypes.
First you need to write out the data with these additional steps:
from fastparquet import write
write('fast.parquet', df, has_nulls=True, times='int96')
The important bit here is the 'times' parameter. See this post, where I found a remedy for the 'date' column.
Why use fastparquet: Fastparquet is much more limited than PyArrow, especially, when it comes to accepted datatypes. On the other hand, the package is much smaller.
Given, that you have exported your data to Parquet and stored it in S3, you can then expose it to Redshift like this:
CREATE EXTERNAL TABLE "<your_external_schema>"."<your_table_name>" (
"int" bigint,
"float" float,
"str" varchar(255),
"boolean" bool,
"date" date,
"timestamp" timestamp)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location
's3://<your_bucket>/<your_prefix>/';
When I started working with Pandas, Parquet and external Redshift tables in the context of AWS Lambda functions, everything was fine for a while. Until I reached a point, where the bundle for my Lambda package reached its allowed limit (Deployment package size). Checking, which of my dependencies made up for all that, I found PyArrow, Pandas and Numpy (dependency of Pandas) to be the culprits. While I could definitely not drop Numpy (for efficiency) and did not want to loose Pandas (convenience, again), I looked to replace PyArrow with something more light-weight. Et voila: Fastparquet. After some research and a lot of experimentation, I could make this also work.
I hope, some other people find this explanation and resources helpful.
Upvotes: 1
Views: 608