ps0604
ps0604

Reputation: 1071

Dask from_array converts types to object

I have the following code that creates a dask dataframe from an array. Problem is that all the types are converted to object. I tried to specify the metadata by couldn't find a way. How to specify meta in from_array?

b = np.array([(1.5, 2, 3, datetime(2000,1,1)), (4, 5, 6, datetime(2001, 2, 2))])
ddf = dd.from_array(b, columns=['col1', 'col2', 'col3', 'date1'], meta=['float', 'float', 'float', 'datetime'])

This throws AttributeError: 'list' object has no attribute '_constructor'

Upvotes: 1

Views: 555

Answers (2)

hpaulj
hpaulj

Reputation: 231375

Look at your b array

In [61]: from datetime import datetime
In [62]: b = np.array([(1.5, 2, 3, datetime(2000,1,1)), (4, 5, 6, datetime(2001
    ...: , 2, 2))])
In [63]: b
Out[63]: 
array([[1.5, 2, 3, datetime.datetime(2000, 1, 1, 0, 0)],
       [4, 5, 6, datetime.datetime(2001, 2, 2, 0, 0)]], dtype=object)

In [93]: pd.DataFrame(b.tolist())
Out[93]: 
     0  1  2          3
0  1.5  2  3 2000-01-01
1  4.0  5  6 2001-02-02
In [94]: _.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   0       2 non-null      float64       
 1   1       2 non-null      int64         
 2   2       2 non-null      int64         
 3   3       2 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(2)
memory usage: 192.0 bytes

In [95]: b1 = np.array([(1.5, 2, 3, np.datetime64(datetime(2000,1,1))), (4, 5,
    ...: 6, np.datetime64(datetime(2001, 2, 2)))], dtype=([('col1','float32'),(
    ...: 'col2','float32'), ('col3','float32'), ('date1','<M8[us]') ]))
In [96]: pd.DataFrame(b1)
Out[96]: 
   col1  col2  col3      date1
0   1.5   2.0   3.0 2000-01-01
1   4.0   5.0   6.0 2001-02-02
In [97]: _.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   col1    2 non-null      float32       
 1   col2    2 non-null      float32       
 2   col3    2 non-null      float32       
 3   date1   2 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float32(3)
memory usage: 168.0 bytes

Upvotes: 1

Alexandra Dudkina
Alexandra Dudkina

Reputation: 4462

You could specify numpy array as a structured array:

import numpy as np
import pandas as pd
import dask.dataframe as dd
from datetime import datetime

b = np.array([(1.5, 2, 3, np.datetime64(datetime(2000,1,1))), (4, 5, 6, np.datetime64(datetime(2001, 2, 2)))], dtype=([('col1','float32'),('col2','float32'), ('col3','float32'), ('date1','<M8[us]') ]))
ddf = dd.from_array(b, columns=['col1', 'col2', 'col3', 'date1'])

ddf.head()

Upvotes: 3

Related Questions