Reputation: 1723

Pandas lookup values from different columns

I'm looking for the best way of solving the following problem in Pandas. Given a dataframe df, e.g.

import pandas as pd
dt = pd.date_range('1/1/2014', periods=10, freq='H')
df = pd.DataFrame({'A':[0.,0.,0.,1.,0.,0.,1.,0.,0.,0.],'B':[0.,0.,0.,0.,1.,0.,0.,0.,0.,1.]},index=dt)

df
                     A  B
2014-01-01 00:00:00  0  0
2014-01-01 01:00:00  0  0
2014-01-01 02:00:00  0  0
2014-01-01 03:00:00  1  0
2014-01-01 04:00:00  0  1
2014-01-01 05:00:00  0  0
2014-01-01 06:00:00  1  0
2014-01-01 07:00:00  0  0
2014-01-01 08:00:00  0  0
2014-01-01 09:00:00  0  1

then for each row where B=1 I would like to find the index of the last row before that where A=1, i.e.

                     A  B                    C
2014-01-01 04:00:00  0  1  2014-01-01 03:00:00
2014-01-01 09:00:00  0  1  2014-01-01 06:00:00

Not sure what the easiest way is to do this on Pandas while also being reasonably efficient for large dataframes.

Update: Should have made clear in my original question that I'd like this to work where there is potentially more than one A=1 per B=1.

Upvotes: 0

Answers (4)

Wesley Bowman

Reputation: 1396

This is what I would do. Easy to understand and readable

import pandas as pd
dt = pd.date_range('1/1/2014', periods=10, freq='H')
df = pd.DataFrame({'A':[0.,0.,0.,1.,0.,0.,1.,0.,0.,0.],
                   'B':[0.,0.,0.,0.,1.,0.,0.,0.,0.,1.]},
                   index=dt)

# How I would do it
temp = np.where(df['B']==1)[0]
newdf = df.iloc[temp]
temp2 = np.where(df['A']==1)[0]
newdf['C'] = df.iloc[temp2].index

which prints

In [8]: newdf
Out[8]: 
                     A  B                   C
2014-01-01 04:00:00  0  1 2014-01-01 03:00:00
2014-01-01 09:00:00  0  1 2014-01-01 08:00:00

Upvotes: 0

FooBar

Reputation: 16488

Notice that the answer from @jeanrjc is not stable if you have more than one A value per B splitted group. To show this, I manipulate the given dataframe:

In[217]: df.iloc[2]['A'] = 1
In[218]: df
Out[218]: 
                     A  B
2014-01-01 00:00:00  0  0
2014-01-01 01:00:00  0  0
2014-01-01 02:00:00  1  0
2014-01-01 03:00:00  1  0
2014-01-01 04:00:00  0  1
2014-01-01 05:00:00  0  0
2014-01-01 06:00:00  1  0
2014-01-01 07:00:00  0  0
2014-01-01 08:00:00  0  0
2014-01-01 09:00:00  0  1

He proceeds to take the timestamp from where ever df.A == 1 - if that happens more than once per subgroup, you have an error.

In[219]: new_df["C"] = df.index[np.where(df.A==1)[0]]
Traceback (...)
ValueError: Length of values does not match length of index

Instead, I create an array that collects the correct indices. If you want to understand what happens, try to understand it from the outside (first, what is df.loc[:b] given my for loop, then the next sub selection etc).

In[239]: new_df['C'] = [df.loc[:b][df.loc[:b].A == 1].iloc[-1].name for b in df.loc[df.B == 1].index ]
In[240]: new_df
Out[240]: 
                     A  B                   C
2014-01-01 04:00:00  0  1 2014-01-01 03:00:00
2014-01-01 09:00:00  0  1 2014-01-01 06:00:00

Upvotes: 1

jrjc

Reputation: 21873

I would do like this:

new_df = df[df.B==1]
new_df["C"] = df.index[np.where(df.A==1)[0]]

new_df

                     A  B                   C
2014-01-01 04:00:00  0  1 2014-01-01 03:00:00
2014-01-01 09:00:00  0  1 2014-01-01 06:00:00

Upvotes: 0

chrisb

Reputation: 52256

Here's one idea - there may be something better.

Populate a a column C with index values where A is 1.

df['C'] = np.where(df['A'] == 1, df.index, '')

Forward fill the missing values, so it represents that last time A was one.

df['C'] = df['C'].fillna(method='ffill')

Finally, fill all the values where B is not 1 with missing values.

df['C'] = np.where(df['B'] == 1, df['C'], '')


In [108]: df
Out[108]: 
                     A  B                   C
2014-01-01 00:00:00  0  0                 NaT
2014-01-01 01:00:00  0  0                 NaT
2014-01-01 02:00:00  0  0                 NaT
2014-01-01 03:00:00  1  0                 NaT
2014-01-01 04:00:00  0  1 2014-01-01 03:00:00
2014-01-01 05:00:00  0  0                 NaT
2014-01-01 06:00:00  1  0                 NaT
2014-01-01 07:00:00  0  0                 NaT
2014-01-01 08:00:00  0  0                 NaT
2014-01-01 09:00:00  0  1 2014-01-01 06:00:00

Upvotes: 1

Pandas lookup values from different columns

Answers (4)

Related Questions