Reputation: 595
I am using fastparquet to convert pandas dataframes to parquet files. It is much faster than my previous approach which was using pyspark.
I want to read these parquet files using spark i.e.
sqlCtx.read.parquet('/tmp/parquet/test.parquet')
I had a few issues which I managed to resolve. The issue I have now is with RLE encoding. I am getting the following java exception when I try to read the parquet file with pyspark:
Unsupported encoding: RLE
Is there a way to disable RLE when using the fastparquet write
method?
Upvotes: 1
Views: 626
Reputation: 28683
This is an optimization within fastparquet for short integers ('int8', 'int16', 'uint8', 'uint16'). Unfortunately, spark does not support the full parquet spec.
If you want your data to be readable by spark, you should first convert integer columns of 32 or 64 bits.
There has been consideration of implementing a "compatibility mode" where these problems go away at the cost of performance, but no concrete plans on that right now.
Upvotes: 1