Reputation: 428
I have an dataframe with a structure like this:
Coumn1 Coumn2
0 (0.00030271668219938874, 0.0002655923890415579... (0.0016430083196610212, 0.0014970217598602176,...
1 (0.00015607803652528673, 0.0001314736582571640... (0.0022136708721518517, 0.0014974646037444472,...
2 (0.011317798867821693, 0.011339936405420303, 0... (0.004868391435593367, 0.004406007472425699, 0...
3 (3.94578673876822e-05, 3.075833956245333e-05, ... (0.0075020878575742245, 0.0096737677231431, 0....
4 (0.0004926157998852432, 0.0003811710048466921,... (0.010351942852139473, 0.008231297135353088, 0...
.. ... ...
130 (0.011190211400389671, 0.011337820440530777, 0... (0.010182800702750683, 0.011351295746862888, 0...
131 (0.006286659277975559, 0.007315031252801418, 0... (0.02104150503873825, 0.02531484328210354, 0.0...
132 (0.0022791570518165827, 0.0025983047671616077,... (0.008847278542816639, 0.009222050197422504, 0...
133 (0.0007059817435219884, 0.0009831463685259223,... (0.0028264704160392284, 0.0029402063228189945,...
134 (0.0018992726691067219, 0.002058899961411953, ... (0.0019639385864138603, 0.002009353833273053, ...
[135 rows x 2 columns]
where each cell holds a list/tuple of some float values:
type(psd_res.data_frame['Column1'][0])
<class 'tuple'>
type(psd_res.data_frame['Column1'][0][0])
<class 'numpy.float64'>
(each cell entry contains the same amount of entries in the tuple)
when i try to save the dataframe now as parquet i get an error (fastparquet):
Can't infer object conversion type: 0 (0.00030271668219938874, 0.0002655923890415579...
1 (0.00015607803652528673, 0.0001314736582571640...
...
Name: Column1, dtype: object
Full stack trace: https://pastebin.com/8Myu8hNV
and i also tried it with the other engine pyarrow:
pyarrow.lib.ArrowInvalid: ('Could not convert (0.00030271668219938874, ..., 0.0002464042045176029)
with type tuple: did not recognize Python value type when inferring an Arrow data type',
'Conversion failed for column UO-Pumpe with type object')
So i found this thread https://github.com/dask/fastparquet/issues/458. It seems to be a bug in fastparquet - but it should work in pyarrow which fails for me.
I then tried some things i found like infer_objects()
and astype(float)
... nothing worked so far.
Does anyone have a solution how i can save my dataframe to parquet?
Upvotes: 5
Views: 8468
Reputation: 13902
The cells of your dataframe contain tuples of float. This is an unusual datatype.
So you need to give arrow a little bit of help to figure out the type of your data. To do so you need to provide the schema of your table explicitely.
df = pd.DataFrame(
{
"column1": [(1.0, 2.0), (3.0, 4.0, 5.0)]
}
)
schema = pa.schema([pa.field('column1', pa.list_(pa.float64()))])
df.to_parquet('/tmp/hello.pq', schema=schema)
Note that if you were using lists of floats (instead of tuples) it would have worked:
df = pd.DataFrame(
{
"column1": [[1.0, 2.0], [3.0, 4.0, 5.0]]
}
)
df.to_parquet('/tmp/hello.pq')
Upvotes: 8