Reputation: 193
Let's say we have the following df with the column names.
df = pd.DataFrame({
'names':['Alan', 'Alan', 'John', 'John', 'Alan', 'Alan','Alan', np.nan, np.nan, np.nan, np.nan, np.nan, 'Christy', 'Christy','John']})
>>> df
names
0 Alan
1 Alan
2 John
3 John
4 Alan
5 Alan
6 Alan
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 Christy
13 Christy
14 John
I would like to run a apply function on the column which returns the max consecutive times a particular values occurs. At first, I would like to do this for NaN but by extension would like to switch to any other value in the column.
Explanation: If we run the apply for Nan, the result would be 5, as 5 is the highest times NaN occurs consecutively. If there were subsequent rows after other values in the column and then NaN occurs consecutively gt than 5 times, then that would be the result.
If we run the apply for Alan, the result would be 3 as 3 would superseed the 2 in the first occurrence of consecutive Alan's.
Upvotes: 2
Views: 1687
Reputation: 2854
Here's a solution you can use with groupby
:
# convert nans to str
df["names"] = df["names"].fillna("NaN")
# assign a subgroup to each set of consecutive rows
df["subgroup"] = df["names"].ne(df["names"].shift()).cumsum()
# take the max length of any subgroup that belongs to "name"
def get_max_consecutive(name):
return df.groupby(["names", "subgroup"]).apply(len)[name].max()
for name in df.names.unique():
print(f"{name}: {get_max_consecutive(name)}")
Output:
Alan: 3
John: 2
NaN: 5
Christy: 2
Explanation:
pandas.Series.ne
takes two series and returns a new series that is True for the elements in each row are not equal and False if they are equal.
We can use df["names"]
and compare it to itself, except shifted by 1 (df["names"].shift()
). This will return True whenever name changes from the previous value.
So this gives us a boolean series where each True
marks a change in name:
df["names"].ne(df["names"].shift())
0 True
1 False
2 True
3 False
4 True
5 False
6 False
7 True
8 False
9 False
10 False
11 False
12 True
13 False
14 True
Name: names, dtype: bool
Then, .cumsum
is just a cumulative sum of this series. In this case, True is equal to 1 and False is 0. This effectively gives us a new number each time the name changes from the previous value. We can assign this to its own column subgroup
so we use groupby with it later.
df.names.ne(df.names.shift()).cumsum()
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 4
10 4
11 4
12 5
13 5
14 6
Name: names, dtype: int64
Lastly, we can use .groupby
to group the dataframe using a multi-index on the "names" and "subgroups" columns. Now we can apply the len
function to get the length of each subgroup.
df.groupby(["names", "subgroup"]).apply(len)
names subgroup
Alan 1 2
3 3
Christy 5 2
John 2 2
6 1
NaN 4 5
dtype: int64
Bonus: You can turn the series returned by .apply
into a dataframe using .reset_index
if you'd like to see the len of each name and subgroup:
df_count = df.groupby(["names", "subgroup"]).apply(len).reset_index(name="len")
df_count
Output:
names subgroup len
0 Alan 1 2
1 Alan 3 3
2 Christy 5 2
3 John 2 2
4 John 6 1
5 NaN 4 5
Upvotes: 2
Reputation: 9619
df_counts = df #create new df to keep the original
df_counts['names'].fillna("NaN", inplace=True) # replace np.nan with string
df_counts['counts'] = df.names.groupby((df.names != df.names.shift()).cumsum()).transform('size') # count consecutive names
df_counts = df_counts.sort_values('counts').drop_duplicates("names",keep='last') #keep only the highest counts
def get_counts(name):
return df_counts.loc[df['names'] == name, 'counts'].item()
Then get_counts("Alan")
will return 3
, and get_counts("NaN")
will return 5
.
Upvotes: 5
Reputation: 703
Since np.nan == np.nan
is False, you have to check if the provided value is NaN before counting. For getting consecutive elements you can use itertools' groupby
.
def max_consecutives(value):
if pd.isna(value):
value_equals = lambda x: pd.isna(x)
else:
value_equals = lambda x: x == value
def max_consecutive_values(col):
elements_per_group_counter = (
sum(1 for elem in group if value_equals(elem))
for _, group in groupby(col)
)
return max(elements_per_group_counter)
return max_consecutive_values
df.apply(max_consecutives(np.nan)) # returns 5
df.apply(max_consecutives("Alan")) # returns 3
Upvotes: 1