Reputation: 695
I have a dataframe in pandas that I want to use pyarrow to write it out as a parquet.
I also need to be able to specify column types. If I change the type via pandas, I get no error; but when I change the it via pyarrow, I get an error. See examples:
import pandas as pd
import pyarrow as pa
data = {"col": [86002575]}
df = pd.DataFrame(data)
df = df.astype({"col": "float32"})
table = pa.Table.from_pandas(df)
No errors
schema = pa.Schema.from_pandas(df)
i = schema.get_field_index("col")
schema = schema.set(i, pa.field("col", pa.float32()))
table = pa.Table.from_pandas(df, schema=schema)
get error:
pyarrow.lib.ArrowInvalid: ('Integer value 86002575 not in range: -16777216 to 16777216', 'Conversion failed for column col with type int64')
I don't even recognize that range either. Is it trying to do some intermediary conversion when converting between the two?
Upvotes: 4
Views: 1845
Reputation: 13902
When converting from one type to another, arrow is much stricter than pandas.
In your case you are converting from int64 to float32. Because they are limits to the exact representation of whole number in floating point, arrow limits the range you can convert to 16777216. Past that limit, the float precision gets bad and if you were to convert the float value back to an int, you are not guaranteed to have the same value.
You can easily ignore these checks though:
schema_float32 = pa.schema([pa.field("col", pa.float32())])
table = pa.Table.from_pandas(df, schema=schema_float32, safe=False)
EDIT:
It's not documented explicitely in arrow. It's common software engineering knowledge.
Any integer with absolute value less than 2^24 can be exactly represented in the single precision format, and any integer with absolute value less than 2^53 can be exactly represented in the double precision format. Furthermore, a wide range of powers of 2 times such a number can be represented. These properties are sometimes used for purely integer data, to get 53-bit integers on platforms that have double precision floats but only 32-bit integers.
2^24 = 16777216
It's not very well documented in arrow. You can look at the code
Upvotes: 2