Reputation: 5498
I'm trying to create a dataframe column using np.where or similar when certain criteria matches as follows:
This linked pickle file is a segment of the data I'm using. Here is the code I've used so far:
data = pd.read_pickle('data_df')
Data:
Open High Low Last Volume
Timestamp
2014-03-04 09:30:00 1783.50 1784.50 1783.50 1784.50 171
2014-03-04 09:31:00 1784.75 1785.75 1784.50 1785.25 28
2014-03-04 09:32:00 1785.00 1786.50 1785.00 1786.50 81
2014-03-04 09:33:00 1786.00 1786.00 1785.25 1785.25 41
2014-03-04 09:34:00 1785.00 1785.25 1784.75 1785.25 11
These are the time variables which I need to use in the np.where element:
#Times
daystart = '9:30'
dayend = '16:14:59'
IB_end = '10:29:59'
IB_session = data.between_time(daystart,IB_end, include_start=True, include_end=True)
day_session = data.between_time(daystart,dayend, include_start=True, include_end=True)
These are the high and low of these blocks of time:
IB_high = IB_session['High'].rolling(window=1,freq='D').max()
IB_low = IB_session['Low'].rolling(window=1,freq='D').min()
I need to write an np.where statement or similar to create new columns showing:
When data['Last'] > IB_high
or when data['Last'] < IB_low
and my statement needs to ensure that only one of these can happen per day. A simple 1 in the column 'IB_High_Break' or 'IB_Low_Break' is needed to mark this happening. This will eventually be used as a signal to enter a trade.
I tried using this:
data['IB_High_Break'] = np.where(data['Last'] > IB_high,1, np.nan);
data['IB_Low_Break'] = np.where(data['Last'] < IB_low,1,np.nan);
But get an error ValueError: Series lengths must match to compare
.
What is a good way to do this?
Upvotes: 2
Views: 550
Reputation: 1002
If You have problem with lengths and could put data to same array, do it like here:...
>>> a = pd.Series(np.arange(10))
>>> b = pd.Series(np.arange(100,111))
>>> df = pd.DataFrame([a,b])
>>> df = df.T
>>> df.loc[4.5]=nan
>>> df.sort_index(inplace=True)
>>> df.interpolate(inplace=True)
>>> df
0 1
0.0 0.0 100.0
1.0 1.0 101.0
2.0 2.0 102.0
3.0 3.0 103.0
4.0 4.0 104.0
4.5 4.5 104.5
5.0 5.0 105.0
6.0 6.0 106.0
7.0 7.0 107.0
8.0 8.0 108.0
9.0 9.0 109.0
10.0 9.0 110.0
>>> df.loc[4.5,1]=0
>>> np.where(df[0]<df[1],1,np.nan)
array([ 1., 1., 1., 1., 1., nan, 1., 1., 1., 1., 1.,
1.])
>>> np.where(df[0]>df[1],1,np.nan)
array([ nan, nan, nan, nan, nan, 1., nan, nan, nan, nan, nan,
nan])
>>>
Otherwise You need to do some iterations...
Upvotes: 1
Reputation: 2182
I think I heard once that pickled files shouldn't be shared amongst unknown/untrusted entities as it's not build to be a secure deserialization, so this is answer is just based on skimming your code.
I think you get pandas.Series of inequal lengths, because your boolean creates a new pandas Series of length equal to rows that met your boolean, and therefore it is impossible for both data['Last'] > IB_high
and data['Last'] < IB_high
to have the same length as the row count in your data dataframe.
I would do something like this instead
data['IB_Low_Break'] = data['Last'].map(lambda i: i < IB_low);
Upvotes: 0