Reputation: 9156
I converted one sample dataframe to .arrow
file using pyarrow
import numpy as np
import pandas as pd
import pyarrow as pa
df = pd.DataFrame({"a": [10, 2, 3]})
df['a'] = pd.to_numeric(df['a'],errors='coerce')
table = pa.Table.from_pandas(df)
writer = pa.RecordBatchFileWriter('test.arrow', table.schema)
writer.write_table(table)
writer.close()
This creates a file test.arrow
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
a 3 non-null int64
dtypes: int64(1)
memory usage: 104.0 bytes
Then in NodeJS I load the file with arrowJS. https://arrow.apache.org/docs/js/
const fs = require('fs');
const arrow = require('apache-arrow');
const data = fs.readFileSync('test.arrow');
const table = arrow.Table.from(data);
console.log(table.schema.fields.map(f => f.name));
console.log(table.count());
console.log(table.get(0));
This prints like
[ 'a' ]
0
null
I was expecting this table will have a length 3 and table.get(0)
gives the first row instead of null
.
This is the table scehem looks like console.log(table._schema)
[ Int_ [Int] { isSigned: true, bitWidth: 16 } ]
Schema {
fields:
[ Field { name: 'a', type: [Int_], nullable: true, metadata: Map {} } ],
metadata:
Map {
'pandas' => '{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 5, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a", "pandas_type": "int16", "numpy_type": "int16", "metadata": null}], "creator": {"library": "pyarrow", "version": "0.15.0"}, "pandas_version": "0.22.0"}' },
dictionaries: Map {} }
Any idea why it is not getting the data as expected?
Upvotes: 5
Views: 1346
Reputation: 175
This is due to a format change in Arrow 0.15, as mentioned by Wes on the Apache JIRA. This means that all Arrow libraries, not just PyArrow, will surface this issue when sending IPC files to older versions of Arrow. The fix is to upgrade ArrowJS to 0.15.0, so that you can round-trip between other Arrow libraries and the JS library. If you can't update for some reason, you can instead use one of the workarounds below:
Pass use_legacy_format=True
as a kwarg to RecordBatchFileWriter
:
with pa.RecordBatchFileWriter('file.arrow', table.schema, use_legacy_format=True) as writer:
writer.write_table(table)
Set the environment variable ARROW_PRE_0_15_IPC_FORMAT
to 1:
$ export ARROW_PRE_0_15_IPC_FORMAT = 1
$ python
>>> import pyarrow as pa
>>> table = pa.Table.from_pydict( {"a": [1, 2, 3], "b": [4, 5, 6]} )
>>> with pa.RecordBatchFileWriter('file.arrow', table.schema) as writer:
... writer.write_table(table)
...
Or downgrade PyArrow to 0.14.x
:
$ conda install -c conda-forge pyarrow=0.14.1
Upvotes: 2