Kermit
Kermit

Reputation: 5992

pyarrow read parquet via column index or order?

Is there a workaround to selectively read parquet files via column index instead of column name?

Documentation shows reading via column name:

pq.read_table('example.parquet', columns=['one', 'three'])

What I'm looking for is something like:

pq.read_table('example.parquet', columns=[0, 2])

Similar question: Pandas Read/Write Parquet Data using Column Index


Update with attempt

This is redundant and I might as well drop columns in memory with either pandas or numpy.

desired_cols = [0,2]

pat = pq.read_table('file.parquet.gzip')

cols_names = pat.column_names

del pat

desired_cols = [cols_names[c] for c in desired_cols]

pq.read_table('file.parquet.gzip',columns=desired_cols)

"""
pyarrow.Table
anzsic06: string
year: int64
"""

Upvotes: 1

Views: 2602

Answers (1)

0x26res
0x26res

Reputation: 13902

You can read the ParquetFile which gives you the schema wihtout loading the underlying data. From there you can figure out the name of the columns you want based on the index, and load only these columns:

# Load meta data & guess column names:
pq_file = pq.ParquetFile('file.parquet')
column_indices = [1, 2]
column_names = [pq_file.schema[i].name for i in column_indices]

# Load the actual data:
pq.read_table('file.parquet', columns=column_names)

See http://arrow.apache.org/docs/python/parquet.html#inspecting-the-parquet-file-metadata

Upvotes: 2

Related Questions