Reading DataFrames saved as parquet with pyarrow, save filenames in columns

Question

I want to read a folder full of parquet files that contain pandas DataFrames. In addition to the data that I'm reading I want to store the filenames from which the data is read in the column "file_origin". In pandas I am able to do it like this:

import pandas as pd
from pathlib import Path

data_dir = Path("path_of_folder_with_files")
df = pd.concat(
                pd.read_parquet(parquet_file).assign(file_origin=parquet_file.name)
                for parquet_file in data_dir.glob("*")
            )

Unfortunately this is quite slow. Is there a similar way to do this with pyarrow (or any other efficient package)?

import pyarrow.parquet as pq

table = pq.read_table(data_dir, use_threads=True)
df = table.to_pandas()

0x26res · Accepted Answer

You could implement it using arrow instead of pandas:

batches = []
for file_name in data_dir.glob("*"):
    table = pq.read_table(file_name)
    table = table.append_column("file_name", pa.array([file_name]*len(table), pa.string()))
    batches.extend(table.to_batches())
return pa.Table.from_batches(batches)

I don't expect it to be significantly faster, unless you have a lot of strings and objects in your table (which are slow in pandas).

Reading DataFrames saved as parquet with pyarrow, save filenames in columns

Answers (1)

Related Questions