clstaudt
clstaudt

Reputation: 22438

Type error on first steps with Apache Parquet

Rather confused by running into this type error while trying out the Apache Parquet file format for the first time. Shouldn't Parquet support all the data types that Pandas does? What am I missing?

import pandas
import pyarrow
import numpy

data = pandas.read_csv("data/BigData.csv", sep="|", encoding="iso-8859-1")
data_parquet = pyarrow.Table.from_pandas(data)

raises:

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-9-90533507bcf2> in <module>()
----> 1 data_parquet = pyarrow.Table.from_pandas(data)

table.pxi in pyarrow.lib.Table.from_pandas()

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads)
    354             arrays = list(executor.map(convert_column,
    355                                        columns_to_convert,
--> 356                                        convert_types))
    357 
    358     types = [x.type for x in arrays]

~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in result_iterator()
    584                     # Careful not to keep a reference to the popped future
    585                     if timeout is None:
--> 586                         yield fs.pop().result()
    587                     else:
    588                         yield fs.pop().result(end_time - time.time())

~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
    423                 raise CancelledError()
    424             elif self._state == FINISHED:
--> 425                 return self.__get_result()
    426 
    427             self._condition.wait(timeout)

~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\thread.py in run(self)
     54 
     55         try:
---> 56             result = self.fn(*self.args, **self.kwargs)
     57         except BaseException as exc:
     58             self.future.set_exception(exc)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\pandas_compat.py in convert_column(col, ty)
    343 
    344     def convert_column(col, ty):
--> 345         return pa.array(col, from_pandas=True, type=ty)
    346 
    347     if nthreads == 1:

array.pxi in pyarrow.lib.array()

array.pxi in pyarrow.lib._ndarray_to_array()

error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Error converting from Python objects to Int64: Got Python object of type str but can only handle these types: integer

data.dtypes is:

0      object
1      object
2      object
3      object
4      object
5     float64
6     float64
7      object
8      object
9      object
10     object
11     object
12     object
13    float64
14     object
15    float64
16     object
17    float64
...

Upvotes: 1

Views: 6364

Answers (3)

Hari_pb
Hari_pb

Reputation: 7416

I faced similar situation, if possible, you can first convert all columns to the required field datatype and then try to convert to parquet. Example :-

import pandas as pd
column_list = df.columns
for col in column_list:
    df[col] = df[col].astype(str)

df.to_parquet('df.parquet.gzip', compression='gzip')

Upvotes: 0

azymm
azymm

Reputation: 312

Had this same issue and took me a while to figure out a way to find the offending column. Here is what I came up with to find the mixed type column - although I know there must be a more efficient way.

The last column printed before the exception happens is the mixed type column.

# method1: try saving the parquet file by removing 1 column at a time to 
# isolate the mixed type column.
cat_cols = df.select_dtypes('object').columns
for col in cat_cols:
    drop = set(cat_cols) - set([col])
    print(col)
    df.drop(drop, axis=1).reset_index(drop=True).to_parquet("c:/temp/df.pq")

Another attempt - list the columns and each type based on the unique values.

# method2: list all columns and the types within
def col_types(col):
    types = set([type(x) for x in col.unique()])
    return types

df.select_dtypes("object").apply(col_types, axis=0)

Upvotes: 1

Wes McKinney
Wes McKinney

Reputation: 105521

In Apache Arrow, table columns must be homogeneous in their data types. pandas supports Python object columns where values can be different types. So you will need to do some data scrubbing before writing to Parquet format.

We've handled some rudimentary cases (like both bytes and unicode in a column) in the Arrow-Python bindings but we don't hazard any guesses about how to handle bad data. I opened the JIRA https://issues.apache.org/jira/browse/ARROW-2098 about adding an option to coerce unexpected values to null in situations like this, which might help in the future.

Upvotes: 2

Related Questions