Arnaud Renaud
Arnaud Renaud

Reputation: 777

Pandas: groupby and get index of first row matching condition

I have a pandas DataFrame called df, sorted in chronological order. Each row is a visit on a website.

df has a column named display that indicates the number of times a specific page has been displayed during the visit. This column is populated by integers, 0 or greater. df also has a user column.

I want to know how many times each user visited the site before ever seeing the business-critical page I'm interested in.

To know that, I need a user-indexed Series populated as follows:

Upvotes: 0

Views: 4966

Answers (2)

Andy Hayden
Andy Hayden

Reputation: 375415

I think it's easier to use plain ol' argmax:

In [11]: df = pd.DataFrame([[1, 0], [1, 0], [1, 1], [2, 0], [2, 1]], columns=['user', 'display'])

In [12]: df
Out[12]:
   user  display
0     1        0
1     1        0
2     1        1
3     2        0
4     2        1

In [13]: df.groupby('user')['display'].apply(lambda x: np.argmax(x.values))
Out[13]:
user
1       2
2       1
Name: display, dtype: int64

Although, for the sake of clarity (or if display wasn't boolean) I would define a new column:

In [21]: df['seen'] = df['display'] > 0

In [22]: df.groupby('user')['seen'].apply(lambda x: np.argmax(x.values))
Out[22]:
user
1       2
2       1
Name: seen, dtype: int64

Note: my old answer said df.groupby('user')['display'].apply(np.argmax) which wasn't quite correct as this gave the first True index.

Upvotes: 2

Arnaud Renaud
Arnaud Renaud

Reputation: 777

df.groupby('user').display.apply(nvisits_before_display)

import numpy as np
def nvisits_before_display(x):
    try:
        return np.where(x > 0)[0].item(0) + 1
    except IndexError:
        return 0

What does this mean?

  • x > 0, when applied to the column display, means that the page has been displayed on a given visit
  • np.where(<condition>)[0] returns a numpy.ndarray containing the positions of the index (ordered integers) where the condition is met
  • item(0) is about taking the first of these positions, meaning the first visit where the page has been displayed
  • + 1 stands for setting value 1 to users who saw the page on their first visit
  • groupby('user') applys the nvisits_before_display function to the rows belonging to each user

Upvotes: 2

Related Questions