Reputation: 143
I have a pandas dataframe. I am saving it to parquet using spark and then trying to read via dask. The issue is that the partitioned column is not being read back using pyarrow engine.
df = pd.DataFrame({'i64': np.arange(1000, dtype=np.int64),
'Ii32': np.arange(1000, dtype=np.int32),
'f': np.arange(1000, dtype=np.float64),
't': [datetime.datetime.now()] * 1000,
'e': ['1'] * 998 + [None,'1'],
'g' : [np.NAN] * 998 + [None, ''],
'bhello': np.random.choice(['hello', 'Yo', 'people', '1'], size=1000).astype("O")})
spark = SparkSession \
.builder \
.appName("Python Spark arrow compatibility") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
#enable metadata write from spark
spark.conf.set("parquet.enable.summary-metadata", "true")
#convert pandas df to spark df
sparkDf = spark.createDataFrame(df)
#write to parquet
sparkDf.write.parquet(path, partitionBy=['bhello'])
#use dask to read the above saved parquet with pyarrow engine
df2 = dd.read_parquet('hdfs://127.0.0.1:8020/tmp/test/outputParquet10',
engine='pyarrow',
)
print(df2.columns)
self.assertIn('bhello', df2.columns)
Any ideas what I am doing wrong here
Upvotes: 1
Views: 2506
Reputation: 693
I will assume that this is a minimal working example. Hence my solution would be to read it using dask
and then convert it using fastparquet
or pyarrow
engines.
Code is below.
import dask.dataframe as dd
ddf=dd.read_csv('/destination/of/your/file/file.format_name')
ddf.to_parquet('/destination/of/your/file/file.parquet',engine = 'fastparquet') #default is fastparquet if both engines are installed.
Hope this helps.
Thanks
Michael
Upvotes: 2