MJS
MJS

Reputation: 1623

python pandas - how can I map values in 1 dataframe to indices in another without looping?

I have 2 dataframes - "df_rollmax" is a derivative of "df_data" with the same shape. I am attempting to map the values of df_rollmax back to df_data and create a third df (df_maxdates) which contains the dates at which each value in df_rollmax originally showed up in df_data.

list1 = [[21,101],[22,110],[25,113],[24,112],[21,109],[28,108],[30,102],[26,106],[25,111],[24,110]]
df_data = pd.DataFrame(list1,index=pd.date_range('2000-1-1',periods=10, freq='D'), columns=list('AB'))
df_rollmax = pd.DataFrame(df_data.rolling(center=False,window=5).max())
mapA = pd.Series(df_data.index, index=df_data['A'])

From a previous question, I see that a single date can be found with:

mapA[rollmax.ix['j','A']] returns Timestamp('2000-01-07 00:00:00')

But my real dataset is much larger and I would like to fill the third dataframe with dates without looping over every row and column.

Mapping back to the indices is a problem due to: ValueError: cannot reindex from a duplicate axis so this isn't working...

df_maxdates = pd.DataFrame(index=df_data.index, columns=df_data.columns)
for s in df_data.columns:
    df_maxdates[s] = mapA.loc[df_rollmax[s]]

Using the last instance of the duplicate value would be fine, but df.duplicated(keep='last') isn't cooperating.

Greatly appreciate any and all wisdom.

Link to original question

Update - this is what df_maxdates would look like:

enter image description here

Upvotes: 1

Views: 1350

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210882

You can use this BrenBarn's solution:

W = 5  # window size

df = pd.DataFrame(columns=df_data.columns, index=df_data.index[W-1:])

for col in df.columns.tolist():
    df[col] = df_data.index[df_data[col].rolling(W)
                                    .apply(np.argmax)[(W-1):]
                                    .astype(int)
                             +
                             np.arange(len(df_data)-(W-1))]

df = pd.DataFrame(columns=df_data.columns, index=df_data.index[:W-1]).append(df)

In [226]: df
Out[226]:
                    A          B
2000-01-01        NaT        NaT
2000-01-02        NaT        NaT
2000-01-03        NaT        NaT
2000-01-04        NaT        NaT
2000-01-05 2000-01-03 2000-01-03
2000-01-06 2000-01-06 2000-01-03
2000-01-07 2000-01-07 2000-01-03
2000-01-08 2000-01-07 2000-01-04
2000-01-09 2000-01-07 2000-01-09
2000-01-10 2000-01-07 2000-01-09

or this piRSquared's solution:

def idxmax(s, w):
    i = 0
    while i + w <= len(s):
        yield(s.iloc[i:i+w].idxmax())
        i += 1

x = pd.DataFrame({'A':[np.nan]*4 + list(idxmax(df_data.A, 5)),
                  'B':[np.nan]*4 + list(idxmax(df_data.B, 5))},
                 index=df_data.index)

Demo:

In [89]: x = pd.DataFrame({'A':pd.to_datetime([np.nan]*4 + list(idxmax(df_data.A, 5))),
    ...:                   'B':pd.to_datetime([np.nan]*4 + list(idxmax(df_data.B, 5)))},
    ...:                  index=df_data.index)
    ...:

In [90]: x
Out[90]:
                    A          B
2000-01-01        NaT        NaT
2000-01-02        NaT        NaT
2000-01-03        NaT        NaT
2000-01-04        NaT        NaT
2000-01-05 2000-01-03 2000-01-03
2000-01-06 2000-01-06 2000-01-03
2000-01-07 2000-01-07 2000-01-03
2000-01-08 2000-01-07 2000-01-04
2000-01-09 2000-01-07 2000-01-09
2000-01-10 2000-01-07 2000-01-09

Upvotes: 1

Related Questions