Reputation: 6429
I am using pandas/dask to do computations an I am storing my data inside a parquet file on disk. The issue is, that I have a column 'time' and also an index that is called time. I want to keep both. When I store the data and then load it later, I get the following errors:
import pyarrow as pa
import pyarrow.parquet as pq
%matplotlib inline
dfx.to_dict()
Out[115]:
{'close': {Timestamp('2017-06-30 01:31:00'): 154.99958999999998,
Timestamp('2017-06-30 01:32:00'): 154.99958999999998,
Timestamp('2017-06-30 01:33:00'): 154.01109,
Timestamp('2017-06-30 01:34:00'): 154.01109,
Timestamp('2017-06-30 01:35:00'): 152.60051000000001},
'time': {Timestamp('2017-06-30 01:31:00'): Timestamp('2017-06-30 01:31:00'),
Timestamp('2017-06-30 01:32:00'): Timestamp('2017-06-30 01:32:00'),
Timestamp('2017-06-30 01:33:00'): Timestamp('2017-06-30 01:33:00'),
Timestamp('2017-06-30 01:34:00'): Timestamp('2017-06-30 01:34:00'),
Timestamp('2017-06-30 01:35:00'): Timestamp('2017-06-30 01:35:00')}}
# set index column
dfx.set_index('time', drop=False, inplace=True)
dfx.head()
Out[117]:
time close
time
2017-06-30 01:31:00 2017-06-30 01:31:00 154.99959
2017-06-30 01:32:00 2017-06-30 01:32:00 154.99959
2017-06-30 01:33:00 2017-06-30 01:33:00 154.01109
2017-06-30 01:34:00 2017-06-30 01:34:00 154.01109
2017-06-30 01:35:00 2017-06-30 01:35:00 152.60051
# store to parquet file format
tdfx = pa.Table.from_pandas(dfx)
pq.write_table(tdfx, 'data.parquet' )
# recovering
dfx = pq.read_table('data.parquet').to_pandas()
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-119-5e9d7cd2ea0d> in <module>()
1 # recovering
----> 2 dfx = pq.read_table('data.parquet').to_pandas()
pyarrow/table.pxi in pyarrow.lib.Table.to_pandas (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:37990)()
/home/ghildebrand/anaconda3/envs/p36/lib/python3.6/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, memory_pool, nthreads)
296 i = schema.get_field_index(name)
297 if i != -1:
--> 298 col = table.column(i)
299 index_name = (None if is_unnamed_index_level(name)
300 else name)
pyarrow/table.pxi in pyarrow.lib.Table.column (/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:38622)()
IndexError: Table column index 2 is out of range
Is this a bug in pyarrow, or is this not possible with parquet or am i am doing something else wrong??
Update: removing the redundant column "time" and keeping only index solves. So i guess the issue is that somewhere in parquet unique sets of column identifiers are created or so.
Upvotes: 1
Views: 2995
Reputation: 105501
Looks a bit buggy to me. I opened a bug report https://issues.apache.org/jira/browse/ARROW-1754, let's continue discussing there.
Upvotes: 2