Reputation: 1071
I have the following code that creates a dask dataframe from an array. Problem is that all the types are converted to object. I tried to specify the metadata by couldn't find a way. How to specify meta in from_array?
b = np.array([(1.5, 2, 3, datetime(2000,1,1)), (4, 5, 6, datetime(2001, 2, 2))])
ddf = dd.from_array(b, columns=['col1', 'col2', 'col3', 'date1'], meta=['float', 'float', 'float', 'datetime'])
This throws AttributeError: 'list' object has no attribute '_constructor'
Upvotes: 1
Views: 555
Reputation: 231375
Look at your b
array
In [61]: from datetime import datetime
In [62]: b = np.array([(1.5, 2, 3, datetime(2000,1,1)), (4, 5, 6, datetime(2001
...: , 2, 2))])
In [63]: b
Out[63]:
array([[1.5, 2, 3, datetime.datetime(2000, 1, 1, 0, 0)],
[4, 5, 6, datetime.datetime(2001, 2, 2, 0, 0)]], dtype=object)
In [93]: pd.DataFrame(b.tolist())
Out[93]:
0 1 2 3
0 1.5 2 3 2000-01-01
1 4.0 5 6 2001-02-02
In [94]: _.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 2 non-null float64
1 1 2 non-null int64
2 2 2 non-null int64
3 3 2 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 192.0 bytes
In [95]: b1 = np.array([(1.5, 2, 3, np.datetime64(datetime(2000,1,1))), (4, 5,
...: 6, np.datetime64(datetime(2001, 2, 2)))], dtype=([('col1','float32'),(
...: 'col2','float32'), ('col3','float32'), ('date1','<M8[us]') ]))
In [96]: pd.DataFrame(b1)
Out[96]:
col1 col2 col3 date1
0 1.5 2.0 3.0 2000-01-01
1 4.0 5.0 6.0 2001-02-02
In [97]: _.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 col1 2 non-null float32
1 col2 2 non-null float32
2 col3 2 non-null float32
3 date1 2 non-null datetime64[ns]
dtypes: datetime64[ns](1), float32(3)
memory usage: 168.0 bytes
Upvotes: 1
Reputation: 4462
You could specify numpy array as a structured array:
import numpy as np
import pandas as pd
import dask.dataframe as dd
from datetime import datetime
b = np.array([(1.5, 2, 3, np.datetime64(datetime(2000,1,1))), (4, 5, 6, np.datetime64(datetime(2001, 2, 2)))], dtype=([('col1','float32'),('col2','float32'), ('col3','float32'), ('date1','<M8[us]') ]))
ddf = dd.from_array(b, columns=['col1', 'col2', 'col3', 'date1'])
ddf.head()
Upvotes: 3