Reputation: 21707
For example:
raw = {'x':[1,2,3,4], 'y':[None,]*4, 'z':[datetime.now()] *4, 'e':[1,2,3,4]}
a = pd.DataFrame(raw, dtype={'x':float, 'y':float, 'z':object, 'e':int})
This doesn't work.
Currently I have to do:
a = pd.DataFrame(raw, dtype=object)
a['x'] = a['x'].astype(float)
a['y'] = a['y'].astype(float)
a['z'] = pd.to_date_time(a['z'], utc=True)
a['e'] = a['e'].astype(int)
Since I have a number of raw
objects I would like to cast into dataframe, is there an easy way to force the right dtypes
at constructing time, instead of transforming them later, which takes 2x time needed.
@Jeff has a good way to deal with raw
if it is in dict format.
But what if raw
is in records format, like:
raw = [(1,None,datetime.now(),1),
(2,None,datetime.now(),2),
(3,None,datetime.now(),3),
(4,None,datetime.now(),4)]
Do I have to zip
it? Perhaps the time taken for zip
would cost more than cast again afterwards?
DataFrame.from_records
doesn't seem to accept a dtype
parameter at all.
Upvotes: 3
Views: 3028
Reputation: 128918
The constructor will infer non-ambiguous types correctly. You cannot specify a compound dtype mapping ATM, issue is here, pull-requests are welcome to implement this.
None
, instead use np.nan
(otherwise it will infer to object
dtype)Series([1,2,3,4],dtype='float')
datetime64[ns]
which is almost always what you want unless you need to specify a timezoneHere's your example
In [20]: DataFrame({
'x':Series([1,2,3,4],dtype='float'),
'y':Series([None,]*4,dtype='float'),
'z':[datetime.datetime.now()] *4,
'e':[1,2,3,4]})
Out[20]:
e x y z
0 1 1 NaN 2014-06-17 07:40:42.188422
1 2 2 NaN 2014-06-17 07:40:42.188422
2 3 3 NaN 2014-06-17 07:40:42.188422
3 4 4 NaN 2014-06-17 07:40:42.188422
In [21]: DataFrame({
'x':Series([1,2,3,4],dtype='float'),
'y':Series([None,]*4,dtype='float'),
'z':[datetime.datetime.now()] *4,
'e':[1,2,3,4]}).dtypes
Out[21]:
e int64
x float64
y float64
z datetime64[ns]
dtype: object
Upvotes: 3