Reputation: 1471
I have a dataframe with datetime index. First of all, here is my fake data.
import pandas as pd
data1 = {'date' : ['20190219 093100', '20190219 103200','20190219 171200','20190219 193900','20190219 194500','20190220 093500','20190220 093600'],
'number' : [18.6125, 12.85, 14.89, 15.8301, 15.85, 14.916 , 14.95]}
df1 = pd.DataFrame(data1)
df1 = df1.set_index('date')
df1.index = pd.to_datetime(df1.index).strftime('%Y-%m-%d %H:%M:%S')
What I want to do is to create a new column named "New_column" with categorical variables with 'Yes' or 'No' depends whether if a value in the "number" column is increased at least 20 percent in the same day.
So in this fake data, only the second value "12.85" will be "Yes" because it increased 23.35 percent at the timestamp "2019-02-19 19:45:00"
Even though the first value is 25% greater than the 3rd value, since it happened in the future, it should not be counted.
After the process, I should have NaN in the "New_column" for the last row of each day.
I have been trying many different ways to do it using:
How can I do this in a Pythonic way?
Upvotes: 0
Views: 738
Reputation: 516
Initial setup
data = {
'datetime' : ['20190219 093100', '20190219 103200','20190219 171200','20190219 193900','20190219 194500','20190220 093500','20190220 093600'],
'number' : [18.6125, 12.85, 14.89, 15.8301, 15.85, 14.916 , 14.95]
}
df = pd.DataFrame(data)
df['datetime'] = df['datetime'].astype('datetime64')
df = df.sort_values('datetime')
df['date'] = df['datetime'].dt.date
df['New_column'] = 'No'
Find all rows that see a 20% increase later in the same day
indeces_true = set([])
for idx_low, row_low in df.iterrows():
for idx_high, row_high in df.iterrows():
if (row_low['date'] == row_high['date'] and
row_low['datetime'] < row_high['datetime'] and
row_low['number'] * 1.2 < row_high['number']):
indeces_true.add(idx_low)
# Assign 'Yes' for the true rows
for i in indeces_true:
df.loc[i, 'New_column'] = 'Yes'
# Last timestamp every day assigned as NaN
df.loc[df['date'] != df['date'].shift(-1), 'New_column'] = np.nan
# Optionally convert to categorical variable
df['New_column'] = pd.Categorical(df['New_column'])
Output
>>> df
datetime number date New_column
0 2019-02-19 09:31:00 18.6125 2019-02-19 No
1 2019-02-19 10:32:00 12.8500 2019-02-19 Yes
2 2019-02-19 17:12:00 14.8900 2019-02-19 No
3 2019-02-19 19:39:00 15.8301 2019-02-19 No
4 2019-02-19 19:45:00 15.8500 2019-02-19 NaN
5 2019-02-20 09:35:00 14.9160 2019-02-20 No
6 2019-02-20 09:36:00 14.9500 2019-02-20 NaN
Upvotes: 2