alcor
alcor

Reputation: 565

Convert Pandas dataframe from/to ORC file

Is it possible to convert a Pandas dataframe from/to an ORC file? I can transform the df in a parquet file, but the library doesn't seem to have ORC support. Is there an available solution in Python? If not, what could be the best strategy? One option could be converting the parquet file to ORC using an external tool, but I have no clue where to find it.

Upvotes: 9

Views: 9919

Answers (3)

Gabe
Gabe

Reputation: 6045

To add to the answer above, Pandas v1.5.0 natively supports writing to ORC files. I'll update this with more documentation when it's released.

my_df.to_orc('myfile.orc')

Upvotes: 5

Asclepius
Asclepius

Reputation: 63252

This answer is tested with pyarrow==4.0.1 and pandas==1.2.5.

It first creates a pyarrow table using pyarrow.Table.from_pandas. It then writes the orc file using pyarrow.orc.ORCFile.

Read orc

import pandas as pd
import pyarrow.orc  # This prevents: AttributeError: module 'pyarrow' has no attribute 'orc'

df = pd.read_orc('/tmp/your_df.orc')

Write orc

import pandas as pd
import pyarrow as pa
import pyarrow.orc as orc

# Here prepare your pandas df.

table = pa.Table.from_pandas(df, preserve_index=False)
orc.write_table(table, '/tmp/your_df.orc')

As of pandas==1.3.0, there isn't a pd.to_orc writer yet.

Upvotes: 7

PHY6
PHY6

Reputation: 391

I have used pyarrow recently which has ORC support, although I've seen a few issues where the pyarrow.orc module is not being loaded.

pip install pyarrow

to use:

import pandas as pd
import pyarrow.orc as orc

with open(filename) as file:
    data = orc.ORCFile(file)
    df = data.read().to_pandas()

Upvotes: 0

Related Questions