Reputation: 5152
Lets consider the following dataframe:
a=pd.DataFrame(data=list(range(10)))
Trying to create a new column from it with apply: The column is supposed to contain Timestamp object
def test(x):
x["date"]=pd.Timestamp("2017-01-01")
return x
a.apply(test,axis=1)
However this result in a dataframe containing the .value
numerical attribute of the timestamp:
0 date
0 0 1483228800000000000
1 1 1483228800000000000
2 2 1483228800000000000
3 3 1483228800000000000
4 4 1483228800000000000
5 5 1483228800000000000
6 6 1483228800000000000
7 7 1483228800000000000
8 8 1483228800000000000
9 9 1483228800000000000
How come it is so? How can I get the correct Timestamp in the DataFrame?
Edit: here the full code that is giving me problem: This is for respondant to understand that i am not trying to set a simply formed list of datetimes to a new column:
def mae(x):
entry=x.name[0]
exit=x.name[1]
m=d.loc[entry:exit,"close"]
if x.dir==-1:
r=(m.sub(m[::-1].cummax()[::-1])).abs().max()
imax=(m.sub(m[::-1].cummax()[::-1])).idxmin()
elif x.dir==1:
r=(m.sub(m[::-1].cummin()[::-1])).abs().max()
imax=(m.sub(m[::-1].cummin()[::-1])).idxmax()
else: r=0
x['mae']=r*gbp['multiplier']
x['peak']=imax
return x
k=g.head().apply(mae,axis=1)
This takes as input another dataframe and some financial prices data stored in a database - so a bit difficult to invent an example that at the same time is simple enough for people to grab it quickly and complicated enough as to justify the use of .apply
.
I think there is something i dont understand / or a bug in the .apply
function - this is what i would like to have input and help on. Thanks guys!
Upvotes: 0
Views: 2648
Reputation: 1258
a few rounds of debugging with your code and further into pandas, it seems there was something you can improve in your code.
you can read more details in _setitem_with_indexer, line 387-393 and in numpy.concatenate
In short, _setitem_with_indexer
makes use of numpy.concatenate
as part of setting new values using indexer.
Because your first column in a
had only integers and your new column a['date']
was trying to concatenate an integer
, with a pandas.Timestamp
(it's actually a numpy.datetime64
),
numpy simply refused to do that.
to show that this is the case, we can illustrate by following examples.
Setup
import pandas as pd
import numpy as np
s1 = [1]
s2 = np.array([np.datetime64("2017-01-01")])
s3 = [pd.Timestamp("2017-01-01")]
a = pd.DataFrame(data=pd.date_range("01-01-2017", "01-06-2017"))
b = pd.DataFrame(data=["d", "c", "d", "d"])
c = pd.DataFrame(data=list(range(10)))
def test(x):
x["date"]=pd.Timestamp("2017-01-01")
return x
Trials
a.apply(test, axis=1)
# output
0 date
0 2017-01-01 2017-01-01
1 2017-01-02 2017-01-01
2 2017-01-03 2017-01-01
3 2017-01-04 2017-01-01
4 2017-01-05 2017-01-01
5 2017-01-06 2017-01-01
b.apply(test, axis=1)
# output
0 date
0 d 1483228800000000000
1 c 1483228800000000000
2 d 1483228800000000000
3 d 1483228800000000000
c.apply(test,axis=1)
# output
0 date
0 0 1483228800000000000
1 1 1483228800000000000
2 2 1483228800000000000
3 3 1483228800000000000
4 4 1483228800000000000
5 5 1483228800000000000
6 6 1483228800000000000
7 7 1483228800000000000
8 8 1483228800000000000
9 9 1483228800000000000
I think this is more behaviour of numpy.concatenate
. if we choose to
use pd.Timestamp we will observe different behaviour, which essentially would succeed in concatenation; but not the same with numpy.datetime64, which is what's used inside _setitem_with_indexer
.
Observation
np.concatenate([s1,s2])
# output
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: invalid type promotion
invalid type promotion
np.concatenate([s1,s3])
# output
array([1, Timestamp('2017-01-01 00:00:00')], dtype=object)
***please note here dtype is object***
As for cleanness or elegance, I think Wen's comment is already very minimal, but you might have a reason for using apply
.
one more thing to point out is, apply takes in a function that does something on each column/row, but it seems you were trying to use the function arg as a DataFrame.
Upvotes: 3
Reputation: 7294
I think you just need to inform Pandas that the column should be interpreted as a Datetime
:
a = pd.DataFrame(data=list(range(10)))
def test(x):
x['date'] = pd.Timestamp('2017-01-01')
return x
a = a.apply(test, axis=1)
a.date = a.date.astype('datetime64[ns]')
This gives:
0 date
0 0 2017-01-01
1 1 2017-01-01
2 2 2017-01-01
3 3 2017-01-01
4 4 2017-01-01
5 5 2017-01-01
6 6 2017-01-01
7 7 2017-01-01
8 8 2017-01-01
9 9 2017-01-01
Alternatively, I was able to do this by creating the column first and setting its type:
a = pd.DataFrame(data=list(range(10)))
a['date'] = None
a.date.astype('datetime64[ns]')
def test(x):
x['date'] = pd.Timestamp('2017-01-01')
return x
a = a.apply(test, axis=1)
Upvotes: 0