Reputation: 2389
I wan to read a parquet file and transform to pandas so I am able to visualize the fields. I am new to parquet structure and getting an error when transforming to pandas. My code is the following:
import pyarrow as pa
import pyarrow.parquet as pq
parquet_file = pq.read_table('/biomedja01/disk1/software/GTEX/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_exon_reads.parquet')
parquet_file.to_pandas()
Here is a bit of the file metadata:
metadata = pq.read_metadata('/biomedja01/disk1/software/GTEX/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9$print(metadata)
print(metadata.row_group(0))
print(metadata.row_group(0).column(0))
<pyarrow._parquet.FileMetaData object at 0x7f92fb146ef0>
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 17384
num_rows: 328671
num_row_groups: 1
format_version: 1.0
serialized_size: 4883225
<pyarrow._parquet.RowGroupMetaData object at 0x7f92fb100be0>
num_columns: 17384
num_rows: 328671
total_byte_size: 11453379595
<pyarrow._parquet.ColumnChunkMetaData object at 0x7f931abfa150>
file_offset: 600791
file_path:
physical_type: BYTE_ARRAY
num_values: 328671
path_in_schema: Description
is_stats_set: True
statistics:
<pyarrow._parquet.RowGroupStatistics object at 0x7f931abfad80>
has_min_max: True
min: b'5S_rRNA'
max: b'yR211F11.2'
null_count: 0
distinct_count: 0
num_values: 328671
physical_type: BYTE_ARRAY
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
has_dictionary_page: True
dictionary_page_offset: 4
data_page_offset: 389078
total_compressed_size: 600787
total_uncompressed_size: 1028503
The error I get when calling parquet_file.to_pandas()
is the following:
Traceback (most recent call last):
File "file.py", line 4, in <module>
parquet_file.to_pandas()
File "pyarrow/table.pxi", line 1410, in pyarrow.lib.Table.to_pandas
File "/home/lingxu/.conda/envs/GTEX/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 618, in table_to_blockmanager
columns = _reconstruct_columns_from_metadata(columns, column_indexes)
File "/home/lingxu/.conda/envs/GTEX/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 735, in _reconstruct_columns_from_metadata
return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)
TypeError: __new__() got an unexpected keyword argument 'labels'
Upvotes: 0
Views: 1879
Reputation: 66
It seems that you have an incompatible version of pandas installed. You could try installing an older version; it looks like 0.25.3 should work.
Upvotes: 2