Reputation: 5992
Is there a workaround to selectively read parquet files via column index instead of column name?
Documentation shows reading via column name:
pq.read_table('example.parquet', columns=['one', 'three'])
What I'm looking for is something like:
pq.read_table('example.parquet', columns=[0, 2])
Similar question: Pandas Read/Write Parquet Data using Column Index
Update with attempt
This is redundant and I might as well drop columns in memory with either pandas or numpy.
desired_cols = [0,2]
pat = pq.read_table('file.parquet.gzip')
cols_names = pat.column_names
del pat
desired_cols = [cols_names[c] for c in desired_cols]
pq.read_table('file.parquet.gzip',columns=desired_cols)
"""
pyarrow.Table
anzsic06: string
year: int64
"""
Upvotes: 1
Views: 2602
Reputation: 13902
You can read the ParquetFile
which gives you the schema wihtout loading the underlying data. From there you can figure out the name of the columns you want based on the index, and load only these columns:
# Load meta data & guess column names:
pq_file = pq.ParquetFile('file.parquet')
column_indices = [1, 2]
column_names = [pq_file.schema[i].name for i in column_indices]
# Load the actual data:
pq.read_table('file.parquet', columns=column_names)
See http://arrow.apache.org/docs/python/parquet.html#inspecting-the-parquet-file-metadata
Upvotes: 2