irene
irene

Reputation: 2243

get the index of previous occurrence of a value in a pandas Series

Let's say I have a dataframe

>> df = pd.DataFrame({'code': ['A', 'A', 'B', 'B', 'C', 'C', 'C', 'B', 'A', 'C', 'C', 'B' ]})
    code
0   A
1   A
2   B
3   B
4   C
5   C
6   C
7   B
8   A
9   C
10  C
11  B

I want to create another column which gives the items in code before the next consecutive occurrence of the same value, as well as the target index of the previous occurrence. For example:

    code target prev_idx
0   A [] NaN
1   A [] 0
2   B [] NaN
3   B [] 2
4   C [] NaN
5   C [] 4
6   C [] 5
7   B [C, C, C] 3
8   A [B, B, C, C, C, B] 1
9   C [B, A] 6
10  C [] 9
11  B [A, C, C] 7

What's the fastest way to go about this?

Upvotes: 0

Views: 674

Answers (2)

tozCSS
tozCSS

Reputation: 6114

def get_target(row,df=None):
    """
    this function is applied to all rows in df
    row.name is the index of the current row
    row.code is the value in the code column
    did row.code value appear ever before the current row?
    if so return the segments starting from its last appearance till current row
    """
    target = (df.code[:row.name]==row.code)
    if target.any():
        prev_idx = target[target].index[-1]
        return {'target':df.code[prev_idx:row.name].tolist()[1:],
                'prev_idx':prev_idx}
    elif target.empty:  # if this is the first row
        return {'target':[],'prev_idx':-1}
    # return the index of the previous row
    return {'target':[],'prev_idx':target.shift().index[-1]}

df[['target','prev_idx']] = df.apply(get_target,df=df,axis=1,result_type='expand')
df

to get

code target prev_idx
0 A [] -1
1 A [] 0
2 B [] 1
3 B [] 2
4 C [] 3
5 C [] 4
6 C [] 5
7 B ['C', 'C', 'C'] 3
8 A ['B', 'B', 'C', 'C', 'C', 'B'] 1
9 C ['B', 'A'] 6
10 C [] 9
11 B ['A', 'C', 'C'] 7

Upvotes: 1

irene
irene

Reputation: 2243

Found another way. It's faster than the answer posted by @tozCSS for my actual application (thousands of rows), probably because we don't search the entire dataframe every time (target = (df.code[:row.name]==row.code)). As a bonus, it also works even if the indexing is not from 0 to len(df).

def get_target(row, data, col):
    if not row['first_of_seq']:
        return []
    else:
        return data.iloc[data.index.get_loc(row['prev_idx'] + 1): data.index.get_loc(row.name)][col].tolist()

def get_elements_in_between(data, col):
    data = data.copy()

    # get the index of the previous occurrence of the item
    old_idx_name = data.index.name
    data = data.rename_axis('dummy_idx').reset_index()
    data['prev_idx'] = data.groupby(col)['dummy_idx'].shift()

    # check if the item is the first of consecutive occurrences or not
    data['first_of_seq'] = ((data['prev_idx'] - data['dummy_idx']) != -1) & (~data['prev_idx'].isna())

    # convert floats to integers (and NaN to <NA>)
    data['prev_idx'] = data['prev_idx'].astype('Int64')

    # revert to the original indexing
    data = data.set_index('dummy_idx')
    data = data.rename_axis(old_idx_name)

    # get the targets using the target indices
    data['target'] = data.apply(get_target, args=(data, col), axis=1)
    
    data.drop(columns=['first_of_seq'], inplace=True)
    
    return data

We get:

>> get_elements_in_between(df, 'code')
    code    prev_idx    target
0   A   <NA>    []
1   A   0   []
2   B   <NA>    []
3   B   2   []
4   C   <NA>    []
5   C   4   []
6   C   5   []
7   B   3   [C, C, C]
8   A   1   [B, B, C, C, C, B]
9   C   6   [B, A]
10  C   9   []
11  B   7   [A, C, C]

Upvotes: 0

Related Questions