Reputation: 333
I ultimately want to count the number of months in a given range per each user. For example, see below, user 1 has 1 range of data from April 2021-June 2021. Where I'm struggling is counting users that multiple ranges (see users 3 & 4).
I have a pandas df w/ columns that looks like these:
username Jan_2021 Feb_2021 March_2021 April_2021 May_2021 June_2021 July_2021 Sum_of_Months
user 1 0 0 0 1 1 1 0 3
user 2 0 0 0 0 0 0 1 1
user 3 1 1 1 0 1 1 0 5
user 4 0 1 1 1 0 1 1 5
Id like to be able to get a summary column that says the number of groups and their count. For example: When I say num of groups I mean the amount of grouped 1's together. and when I say length of group I mean the amount of months in 1 group, like if I were to draw a circle around the 1s. For example, user 1 is 3 because there's a 1 in columns April-June 2021
username Num_of_groups Lenth_of_group
user 1 1 3
user 2 1 1
user 3 2 3,2
user 4 2 3,2
Upvotes: 0
Views: 53
Reputation: 1986
You can try with groupby function from itertools
from itertools import groupby
df1 = df[[col for col in df.columns if "2021" in col]]
df["Lenth_of_group"] = df1.apply(lambda x: [sum(g) for i, g in groupby(x) if i == 1],axis=1)
df["Num_of_groups"] = df["Lenth_of_group"].apply(lambda x: len(x))
Hope this Helps...
Upvotes: 2
Reputation: 2261
This solution uses staircase
, and works by treating each users data as a step function (of 1s and 0s)
setup
import pandas as pd
df = pd.DataFrame(
{
"username":["user 1", "user 2", "user 3", "user 4"],
"Jan_2021":[0,0,1,0],
"Feb_2021":[0,0,1,0],
"Mar_2021":[0,0,1,1],
"April_2021":[1,0,0,1],
"May_2021":[1,0,1,0],
"June_2021":[1,0,1,1],
"July_2021":[0,1,1,0],
"Sum_of_Months":[3,1,5,5],
}
)
solution
import staircase as sc
# trim down to month columns only, and transpose to make users correspond to columns, and months correspond to rows
data = df[["Jan_2021", "Feb_2021", "Mar_2021", "April_2021", "May_2021", "June_2021", "July_2021"]].transpose().reset_index(drop=True)
def extract_groups(series):
return (
sc.Stairs.from_values(initial_value=0, values=series) # create step function for each user
.clip(0, len(series)) # clip step function to region of interest
.to_frame() # represent data as start/stop intervals in a dataframe
.query("value==1") # filter for groups of 1s
.eval("dist=end-start") # calculate the length of each "group"
["dist"].to_list() # convert the result from Series to list
)
sfs = data.columns.to_series().apply(lambda c: extract_groups(data[c]))
sfs
is a pandas.Series
where the values are lists representing number of groups and the lengths of each. It looks like this:
0 [3]
1 [1]
2 [3, 3]
3 [2, 1]
dtype: object
You can use it to create the data you need, eg
df["Num_of_groups"] = sfs.apply(list.__len__)
adds the Num_of_groups
column to your original dataframe
Disclaimer: I am author of staircase
Upvotes: 1