Reputation: 565
Is it possible to convert a Pandas dataframe from/to an ORC file? I can transform the df in a parquet file, but the library doesn't seem to have ORC support. Is there an available solution in Python? If not, what could be the best strategy? One option could be converting the parquet file to ORC using an external tool, but I have no clue where to find it.
Upvotes: 9
Views: 9919
Reputation: 6045
To add to the answer above, Pandas v1.5.0 natively supports writing to ORC files. I'll update this with more documentation when it's released.
my_df.to_orc('myfile.orc')
Upvotes: 5
Reputation: 63252
This answer is tested with pyarrow==4.0.1
and pandas==1.2.5
.
It first creates a pyarrow table using pyarrow.Table.from_pandas
. It then writes the orc file using pyarrow.orc.ORCFile
.
import pandas as pd
import pyarrow.orc # This prevents: AttributeError: module 'pyarrow' has no attribute 'orc'
df = pd.read_orc('/tmp/your_df.orc')
import pandas as pd
import pyarrow as pa
import pyarrow.orc as orc
# Here prepare your pandas df.
table = pa.Table.from_pandas(df, preserve_index=False)
orc.write_table(table, '/tmp/your_df.orc')
As of pandas==1.3.0
, there isn't a pd.to_orc
writer yet.
Upvotes: 7
Reputation: 391
I have used pyarrow recently which has ORC support, although I've seen a few issues where the pyarrow.orc module is not being loaded.
pip install pyarrow
to use:
import pandas as pd
import pyarrow.orc as orc
with open(filename) as file:
data = orc.ORCFile(file)
df = data.read().to_pandas()
Upvotes: 0