moth
moth

Reputation: 2389

read parquet files and convert to pandas using pyarrow

I wan to read a parquet file and transform to pandas so I am able to visualize the fields. I am new to parquet structure and getting an error when transforming to pandas. My code is the following:

import pyarrow as pa
import pyarrow.parquet as pq
parquet_file = pq.read_table('/biomedja01/disk1/software/GTEX/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_exon_reads.parquet')
parquet_file.to_pandas()

Here is a bit of the file metadata:

metadata = pq.read_metadata('/biomedja01/disk1/software/GTEX/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9$print(metadata)
print(metadata.row_group(0))
print(metadata.row_group(0).column(0))


<pyarrow._parquet.FileMetaData object at 0x7f92fb146ef0>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 17384
  num_rows: 328671
  num_row_groups: 1
  format_version: 1.0
  serialized_size: 4883225
<pyarrow._parquet.RowGroupMetaData object at 0x7f92fb100be0>
  num_columns: 17384
  num_rows: 328671
  total_byte_size: 11453379595
<pyarrow._parquet.ColumnChunkMetaData object at 0x7f931abfa150>
  file_offset: 600791
  file_path:
  physical_type: BYTE_ARRAY
  num_values: 328671
  path_in_schema: Description
  is_stats_set: True
  statistics:
    <pyarrow._parquet.RowGroupStatistics object at 0x7f931abfad80>
      has_min_max: True
      min: b'5S_rRNA'
      max: b'yR211F11.2'
      null_count: 0
      distinct_count: 0
      num_values: 328671
      physical_type: BYTE_ARRAY
  compression: SNAPPY
  encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
  has_dictionary_page: True
  dictionary_page_offset: 4
  data_page_offset: 389078
  total_compressed_size: 600787
  total_uncompressed_size: 1028503

The error I get when calling parquet_file.to_pandas() is the following:

Traceback (most recent call last):
  File "file.py", line 4, in <module>
    parquet_file.to_pandas()
  File "pyarrow/table.pxi", line 1410, in pyarrow.lib.Table.to_pandas
  File "/home/lingxu/.conda/envs/GTEX/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 618, in table_to_blockmanager
    columns = _reconstruct_columns_from_metadata(columns, column_indexes)
  File "/home/lingxu/.conda/envs/GTEX/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 735, in _reconstruct_columns_from_metadata
    return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)
TypeError: __new__() got an unexpected keyword argument 'labels'

Upvotes: 0

Views: 1879

Answers (1)

Mathijs
Mathijs

Reputation: 66

It seems that you have an incompatible version of pandas installed. You could try installing an older version; it looks like 0.25.3 should work.

Upvotes: 2

Related Questions