colinfang
colinfang

Reputation: 21707

How to force pandas DataFrame use the desired dtypes when it is constructed?

For example:

raw = {'x':[1,2,3,4], 'y':[None,]*4, 'z':[datetime.now()] *4, 'e':[1,2,3,4]}

a = pd.DataFrame(raw, dtype={'x':float, 'y':float, 'z':object, 'e':int})

This doesn't work.

Currently I have to do:

a = pd.DataFrame(raw, dtype=object)
a['x'] = a['x'].astype(float)
a['y'] = a['y'].astype(float)
a['z'] = pd.to_date_time(a['z'], utc=True)
a['e'] = a['e'].astype(int)

Since I have a number of raw objects I would like to cast into dataframe, is there an easy way to force the right dtypes at constructing time, instead of transforming them later, which takes 2x time needed.

@Jeff has a good way to deal with raw if it is in dict format.

But what if raw is in records format, like:

raw = [(1,None,datetime.now(),1),
       (2,None,datetime.now(),2), 
       (3,None,datetime.now(),3),
       (4,None,datetime.now(),4)]

Do I have to zip it? Perhaps the time taken for zip would cost more than cast again afterwards? DataFrame.from_records doesn't seem to accept a dtype parameter at all.

Upvotes: 3

Views: 3028

Answers (1)

Jeff
Jeff

Reputation: 128918

The constructor will infer non-ambiguous types correctly. You cannot specify a compound dtype mapping ATM, issue is here, pull-requests are welcome to implement this.

  • Don't use None, instead use np.nan (otherwise it will infer to object dtype)
  • Specify floats with a decimal point (or wrap as a Series, e.g. Series([1,2,3,4],dtype='float')
  • datetimes will automatically infer to datetime64[ns] which is almost always what you want unless you need to specify a timezone

Here's your example

In [20]: DataFrame({
    'x':Series([1,2,3,4],dtype='float'), 
    'y':Series([None,]*4,dtype='float'), 
    'z':[datetime.datetime.now()] *4, 
    'e':[1,2,3,4]})
Out[20]: 
   e  x   y                          z
0  1  1 NaN 2014-06-17 07:40:42.188422
1  2  2 NaN 2014-06-17 07:40:42.188422
2  3  3 NaN 2014-06-17 07:40:42.188422
3  4  4 NaN 2014-06-17 07:40:42.188422

In [21]: DataFrame({
     'x':Series([1,2,3,4],dtype='float'), 
     'y':Series([None,]*4,dtype='float'), 
     'z':[datetime.datetime.now()] *4, 
     'e':[1,2,3,4]}).dtypes
Out[21]: 
e             int64
x           float64
y           float64
z    datetime64[ns]
dtype: object

Upvotes: 3

Related Questions