Shiva Sankar
Shiva Sankar

Reputation: 1

Pandas table to Pyarrow conversion not working for string to int

I have a xlsx file with 15000 records. I'm trying to serialize the data for a API services. Read the file and send it in HTTP response.

Input data look as below

account_name | dr_code | cr_code |amount | rate | category
A            | 12582   | 12582   |5000   |30    | POP
B            | 55AG98  | 55AG98  |2000   |40    | POP
C            | 5ER0AB  |         |5000   |2.2    | POP 
 

Code as below

df = pandas.read_excel(file.xlsx, {usecols: [0, 1,4,6,7, 8]})
b = pyarrow.Table.from_pandas(df, preserve_index=True)

I get this error pyarrow.lib.ArrowInvalid : ("Could not convert "55AG98" with type str" : tried to convert to int, 'conversion failed for the column dr_code with type object')

Above code works if column has values with same datatype but error on multiple datatype.

Upvotes: 0

Views: 2419

Answers (1)

Nick ODell
Nick ODell

Reputation: 25220

If you don't want pyarrow to guess what types the result should have, you need to pass a schema when doing this conversion.

E.g.

import pandas
import pyarrow
df = pandas.read_excel(file.xlsx, {usecols: [0, 1,4,6,7, 8]})
schema = pyarrow.schema([
    ('account_name', pa.string()),
    ('dr_code', pa.string()),
    ('cr_code', pa.string()),
    ('amount', pa.float64()),
    ('rate', pa.float64()),
    ('category', pa.string()),
])
b = pyarrow.Table.from_pandas(df, schema=schema, preserve_index=True)

Upvotes: 1

Related Questions