Sarath
Sarath

Reputation: 9156

Converted apache arrow file from data frame gives null while reading with arrow.js

I converted one sample dataframe to .arrow file using pyarrow

import numpy as np
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({"a": [10, 2, 3]})
df['a'] = pd.to_numeric(df['a'],errors='coerce')
table = pa.Table.from_pandas(df)
writer = pa.RecordBatchFileWriter('test.arrow', table.schema)
writer.write_table(table)
writer.close()

This creates a file test.arrow

df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 3 entries, 0 to 2
    Data columns (total 1 columns):
    a    3 non-null int64
    dtypes: int64(1)
    memory usage: 104.0 bytes

Then in NodeJS I load the file with arrowJS. https://arrow.apache.org/docs/js/

const fs = require('fs');
const arrow = require('apache-arrow');

const data = fs.readFileSync('test.arrow');
const table = arrow.Table.from(data);

console.log(table.schema.fields.map(f => f.name));
console.log(table.count());
console.log(table.get(0));

This prints like

[ 'a' ]
0
null

I was expecting this table will have a length 3 and table.get(0) gives the first row instead of null.

This is the table scehem looks like console.log(table._schema)

[ Int_ [Int] { isSigned: true, bitWidth: 16 } ]
Schema {
  fields:
   [ Field { name: 'a', type: [Int_], nullable: true, metadata: Map {} } ],
  metadata:
   Map {
     'pandas' => '{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 5, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a", "pandas_type": "int16", "numpy_type": "int16", "metadata": null}], "creator": {"library": "pyarrow", "version": "0.15.0"}, "pandas_version": "0.22.0"}' },
  dictionaries: Map {} }

Any idea why it is not getting the data as expected?

Upvotes: 5

Views: 1346

Answers (1)

Joe Quigley
Joe Quigley

Reputation: 175

This is due to a format change in Arrow 0.15, as mentioned by Wes on the Apache JIRA. This means that all Arrow libraries, not just PyArrow, will surface this issue when sending IPC files to older versions of Arrow. The fix is to upgrade ArrowJS to 0.15.0, so that you can round-trip between other Arrow libraries and the JS library. If you can't update for some reason, you can instead use one of the workarounds below:

Pass use_legacy_format=True as a kwarg to RecordBatchFileWriter:

with pa.RecordBatchFileWriter('file.arrow', table.schema, use_legacy_format=True) as writer:
    writer.write_table(table)

Set the environment variable ARROW_PRE_0_15_IPC_FORMAT to 1:

$ export ARROW_PRE_0_15_IPC_FORMAT = 1
$ python
>>> import pyarrow as pa
>>> table = pa.Table.from_pydict( {"a": [1, 2, 3], "b": [4, 5, 6]} )
>>> with pa.RecordBatchFileWriter('file.arrow', table.schema) as writer:
...   writer.write_table(table)
...

Or downgrade PyArrow to 0.14.x:

$ conda install -c conda-forge pyarrow=0.14.1

Upvotes: 2

Related Questions