elixir
elixir

Reputation: 193

In a pandas column, how to find the max number of consecutive rows that a particular value occurs?

Let's say we have the following df with the column names.

df = pd.DataFrame({
    'names':['Alan', 'Alan', 'John', 'John', 'Alan', 'Alan','Alan', np.nan, np.nan, np.nan, np.nan, np.nan, 'Christy', 'Christy','John']})
>>> df
      names
0      Alan
1      Alan
2      John
3      John
4      Alan
5      Alan
6      Alan
7       NaN
8       NaN
9       NaN
10      NaN
11      NaN
12  Christy
13  Christy
14     John

I would like to run a apply function on the column which returns the max consecutive times a particular values occurs. At first, I would like to do this for NaN but by extension would like to switch to any other value in the column.

Explanation: If we run the apply for Nan, the result would be 5, as 5 is the highest times NaN occurs consecutively. If there were subsequent rows after other values in the column and then NaN occurs consecutively gt than 5 times, then that would be the result.

If we run the apply for Alan, the result would be 3 as 3 would superseed the 2 in the first occurrence of consecutive Alan's.

Upvotes: 2

Views: 1687

Answers (3)

kennyvh
kennyvh

Reputation: 2854

Here's a solution you can use with groupby:

# convert nans to str
df["names"] = df["names"].fillna("NaN")

# assign a subgroup to each set of consecutive rows
df["subgroup"] = df["names"].ne(df["names"].shift()).cumsum()

# take the max length of any subgroup that belongs to "name"
def get_max_consecutive(name):
    return df.groupby(["names", "subgroup"]).apply(len)[name].max()

for name in df.names.unique():
    print(f"{name}: {get_max_consecutive(name)}")

Output:

Alan: 3
John: 2
NaN: 5
Christy: 2

Explanation:

pandas.Series.ne takes two series and returns a new series that is True for the elements in each row are not equal and False if they are equal.

We can use df["names"] and compare it to itself, except shifted by 1 (df["names"].shift()). This will return True whenever name changes from the previous value.

So this gives us a boolean series where each True marks a change in name:

df["names"].ne(df["names"].shift())

0      True
1     False
2      True
3     False
4      True
5     False
6     False
7      True
8     False
9     False
10    False
11    False
12     True
13    False
14     True
Name: names, dtype: bool

Then, .cumsum is just a cumulative sum of this series. In this case, True is equal to 1 and False is 0. This effectively gives us a new number each time the name changes from the previous value. We can assign this to its own column subgroup so we use groupby with it later.

df.names.ne(df.names.shift()).cumsum()

0     1
1     1
2     2
3     2
4     3
5     3
6     3
7     4
8     4
9     4
10    4
11    4
12    5
13    5
14    6
Name: names, dtype: int64

Lastly, we can use .groupby to group the dataframe using a multi-index on the "names" and "subgroups" columns. Now we can apply the len function to get the length of each subgroup.

df.groupby(["names", "subgroup"]).apply(len)

names    subgroup
Alan     1           2
         3           3
Christy  5           2
John     2           2
         6           1
NaN      4           5
dtype: int64

Bonus: You can turn the series returned by .apply into a dataframe using .reset_index if you'd like to see the len of each name and subgroup:

df_count = df.groupby(["names", "subgroup"]).apply(len).reset_index(name="len")
df_count

Output:

     names  subgroup  len
0     Alan         1    2
1     Alan         3    3
2  Christy         5    2
3     John         2    2
4     John         6    1
5      NaN         4    5

Upvotes: 2

RJ Adriaansen
RJ Adriaansen

Reputation: 9619

df_counts = df #create new df to keep the original
df_counts['names'].fillna("NaN", inplace=True) # replace np.nan with string
df_counts['counts'] = df.names.groupby((df.names != df.names.shift()).cumsum()).transform('size') # count consecutive names
df_counts = df_counts.sort_values('counts').drop_duplicates("names",keep='last') #keep only the highest counts

def get_counts(name):
  return df_counts.loc[df['names'] == name, 'counts'].item()

Then get_counts("Alan") will return 3, and get_counts("NaN") will return 5.

Upvotes: 5

Alex G
Alex G

Reputation: 703

Since np.nan == np.nan is False, you have to check if the provided value is NaN before counting. For getting consecutive elements you can use itertools' groupby.

def max_consecutives(value):
    if pd.isna(value):
        value_equals = lambda x: pd.isna(x)
    else:
        value_equals = lambda x: x == value
        
    def max_consecutive_values(col):
        elements_per_group_counter = (
            sum(1 for elem in group if value_equals(elem))
            for _, group in groupby(col)
        )
        return max(elements_per_group_counter)
    return max_consecutive_values

df.apply(max_consecutives(np.nan)) # returns 5

df.apply(max_consecutives("Alan")) # returns 3

Upvotes: 1

Related Questions