thestephenstanton
thestephenstanton

Reputation: 695

Converting schemas via pandas vs pyarrow

I have a dataframe in pandas that I want to use pyarrow to write it out as a parquet.

I also need to be able to specify column types. If I change the type via pandas, I get no error; but when I change the it via pyarrow, I get an error. See examples:

Given

import pandas as pd
import pyarrow as pa

data = {"col": [86002575]}
df = pd.DataFrame(data)

Via Pandas

df = df.astype({"col": "float32"})

table = pa.Table.from_pandas(df)

No errors

Via PyArrow

schema = pa.Schema.from_pandas(df)
i = schema.get_field_index("col")
schema = schema.set(i, pa.field("col", pa.float32()))

table = pa.Table.from_pandas(df, schema=schema)

get error:

pyarrow.lib.ArrowInvalid: ('Integer value 86002575 not in range: -16777216 to 16777216', 'Conversion failed for column col with type int64')

I don't even recognize that range either. Is it trying to do some intermediary conversion when converting between the two?

Upvotes: 4

Views: 1845

Answers (1)

0x26res
0x26res

Reputation: 13902

When converting from one type to another, arrow is much stricter than pandas.

In your case you are converting from int64 to float32. Because they are limits to the exact representation of whole number in floating point, arrow limits the range you can convert to 16777216. Past that limit, the float precision gets bad and if you were to convert the float value back to an int, you are not guaranteed to have the same value.

You can easily ignore these checks though:

schema_float32 = pa.schema([pa.field("col", pa.float32())])
table = pa.Table.from_pandas(df, schema=schema_float32, safe=False)

EDIT:

It's not documented explicitely in arrow. It's common software engineering knowledge.

In wikipedia:

Any integer with absolute value less than 2^24 can be exactly represented in the single precision format, and any integer with absolute value less than 2^53 can be exactly represented in the double precision format. Furthermore, a wide range of powers of 2 times such a number can be represented. These properties are sometimes used for purely integer data, to get 53-bit integers on platforms that have double precision floats but only 32-bit integers.

2^24 = 16777216

It's not very well documented in arrow. You can look at the code

Upvotes: 2

Related Questions