Reputation: 173
I am trying to modify my pandas dataframe, but still I haven't gotten any results. I have a dataframe of this form:
fc.iloc[:,:2][:10]
Out[309]:
DATE TOTAL_REVENUE_TTM
TIC
AMZN.O 2018-01-01 3.769385e+11
AMZN.O 2018-01-01 3.775364e+11
AMZN.O 2018-01-01 3.772219e+11
AMZN.O 2018-01-01 3.766596e+11
AMZN.O 2018-01-01 3.757391e+11
AMZN.O 2018-02-01 3.931331e+11
AMZN.O 2018-02-01 3.935390e+11
AMZN.O 2018-02-01 3.932100e+11
AMZN.O 2018-02-01 3.926973e+11
AMZN.O 2018-02-01 3.917282e+11
I need to properly modify the column DATE, which does not reflect the real date. The hard thing is that I need to modify it following a sort of pattern. Here I printed only the 10 columns, but I have many observation, that follow the same rule.
To summarize, these are the outputs of size 5 of an LSTM network (I printed just two outputs) and the printed date is only the last date of the output. I need to recover all the other dates, knowing that they are all shifted back in time by 12 months. Practically speaking I should offset the date in 4th row by 1 year, the date in 3rd row by 2 year and so on. This should be applied for every output (say, every 5 row of the dataframe), and for every company in the dataframe.
The way the code works should be also generalized, if for example I would change the size of the output, so that I could have 3 dates repeating instead of 5, and also if I change the offsetting back in time.
I was trying to write a function that takes those 2 parameters and returns the entire dataframe with the dates properly shifted, but I am having a hard time find the way to reproduce the pattern correctly.
Here there is a piece of code for my function that should take every piece of dataframe referring to a company, elaborate it and then concatenate again to form the new dataframe:
def modify_forecasts_dates(fc,offset,output_size):
#extract the company names
companies = fc.index.get_level_values(0).unique()
cmp_to_concat = []
for tick in companies:
cmp = fc.loc[tick]
cmp.insert(0, 'ROW_NUM', range(1,len(cmp)+1))
cmp['BOOL'] = np.where((np.mod(cmp['ROW_NUM'],output_size)==0), 'OK', 'MODIFY')
#cmp.drop(labels=['ROW_NUM','BOOL'],inplace=True)
cmp_to_concat.append(cmp)
final_df = pd.concat(cmp_to_concat)
return final_df
I firstly tried to insert a count row number, to identify which are the row that do not need to be changed, namely every last row of a piece of output. The problem still remains because then I don't know how to update all observation before the last output row.
Is there an easy way to take the dataframe by a step of 5 row (or equivalently by a step of output size) and properly offset all the preceding dates, obtaining t, t-offset, t-2*offset and so on?
I know it is a bit complicated to udnerstand what I need to do, but I am really struggling. I could do it by hand, but it's preferable to automatize the process.
Upvotes: 0
Views: 308
Reputation: 150745
I assume your data will come in group of num_year
rows (in this case num_year=5
). Also, I found it unsafe to have non-unique index, which make updating a column more difficult. So I choose to reset index:
# sample data
df = pd.DataFrame({'tic':['a']*10+['b']*20,
'date': sorted(list(pd.date_range('2018-01-01', '2018-01-06', freq='d')
)* 5 )}).set_index('tic')
# make sure companies are grouped together,
# skip if they are already
df = df.reset_index().sort_values(['tic','date'])
num_year = 5
# offset pattern
delta = pd.Series([pd.DateOffset(years=i)
for i in range(num_year-1,-1,-1)] * (len(df)//num_year))
# update date
df.date = df.date - delta
df.head(10)
Output:
+----+------+------------+
| | tic | date |
+----+------+------------+
| 0 | a | 2014-01-01 |
| 1 | a | 2015-01-01 |
| 2 | a | 2016-01-01 |
| 3 | a | 2017-01-01 |
| 4 | a | 2018-01-01 |
| 5 | a | 2014-01-02 |
| 6 | a | 2015-01-02 |
| 7 | a | 2016-01-02 |
| 8 | a | 2017-01-02 |
| 9 | a | 2018-01-02 |
+----+------+------------+
Upvotes: 1