Reputation: 4818
Using pyarrow 0.6.0 (or inferior), the following snippet causes the Python interpreter to crash:
data = pd.DataFrame({'a': [1, True]})
pa.Table.from_pandas(data)
"The Python interpreter has stopped working" (under windows)
Upvotes: 1
Views: 4661
Reputation: 4818
Following some investigation, the issue is solved in pyarrow 0.7.0 according to this Jira issue and more precisely this commit using the same snippet as in the question, now instead of crashing the interpreter we obtain the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "table.pxi", line 755, in pyarrow.lib.Table.from_pandas
File "C:\Temp\tt\Tools\Anaconda3.4.3.1\envs\GMF_test3\lib\site-packages\pyarrow\pandas_compat.py", line 227, in dataframe_to_arrays
col, type=type, timestamps_to_ms=timestamps_to_ms
File "array.pxi", line 225, in pyarrow.lib.Array.from_pandas
File "error.pxi", line 77, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Error converting from Python objects to Int64: Got Python object of type bool but can only handle these ty
pes: integer
One possibility to workaround the issue is when you master your data, to convert the column with mixed dtypes when the exception occurs, like the following (and probably log the exception as it is not a common mistake):
import pandas as pd
import pyarrow as pa
import logging
logger = logging.getLogger(__name__)
data = pd.DataFrame({'a': [1, True], 'b': [1, 2]})
def convert_type_if_needed(type_to_select, df, col_name):
types = []
for i in df[col_name]:
types.append(type(i))
if type_to_select in types:
return df.astype({col_name: type_to_select})
else:
raise TypeError(str(type_to_select) + " is not in the dataframe, conversion impossible")
try:
table = pa.Table.from_pandas(data)
except pa.lib.ArrowInvalid as e:
logger.warning(e)
data = convert_type_if_needed(int, data, 'a')
table = pa.Table.from_pandas(data)
print(table)
Which finally yields:
pyarrow.Table
Error converting from Python objects to Int64: Got Python object of type bool but can only handle these types: integer
a: int32
b: int64
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"columns": [{"name": "a", "numpy_type": "int32", "pandas_type":'
b' "int32", "metadata": null}, {"name": "b", "numpy_type": "int64"'
b', "pandas_type": "int64", "metadata": null}, {"name": "__index_l'
b'evel_0__", "numpy_type": "int64", "pandas_type": "int64", "metad'
b'ata": null}], "index_columns": ["__index_level_0__"], "pandas_ve'
b'rsion": "0.20.3"}'}
Upvotes: 3