Reputation: 635
I have a DataFrame which have a column with statuses like this:
datetime | session | try | status
2020-09-17 10:00:01 | '1a' | '1a_1' | 'success'
2020-09-17 10:00:02 | '2a' | '2a_1' | 'fail'
2020-09-17 10:00:03 | '2a' | '2a_2' | 'success'
2020-09-17 10:00:03 | '3a' | '3a_1' | 'interrupted'
2020-09-18 13:00:04 | '4a' | '4a_1' | 'fail'
I want to resample data by day with counting status types by condition in sessions (not tries).
I can resample it by tries easyly like this:
df['date'] = df['datetime'].dt.date
df['ones'] = np.ones(df.shape[0])
piv = df.pivot_table(index='date', columns='status', values='ones', aggfunc=len).fillna(0)
And have:
day | success | fail | interrupted
2020-09-17 | 2 | 2 | 1
2020-09-18 | 0 | 1 | 0
But I want to aggregate it by session with condition no matter how much tries in session.:
So I shoul get something like this:
day | success | fail | interrupted
2020-09-17 | 2 | 0 | 1
2020-09-18 | 0 | 1 | 0
I stuck with function and all I come up with ends with "ValueError: The truth value of a Series is ambiguous". I will be very greatfull for any ideas.
Upvotes: 0
Views: 752
Reputation: 2348
pandas
has quite a lot of different ways to aggregate information. One option here would be to use a groupby
. I like this approach as it's quite robust - another answer outlines a very elegant approach using crosstab
but I wonder how extensible that would be.
NB your description of what you want doesn't seem to match the example output - you mentioned that you wanted to aggregate by session, rather than date. It is easy to do either of these with this approach.
def count_successes(x):
s = list(x)
return len([i for i in s if i == "success"])
def count_interrupteds(x):
s = list(x)
if "success" in s:
return 0
return len([i for i in s if i == "interrupted"])
def count_failures(x):
s = list(x)
if "success" in s:
return 0
if "interrupted" in s:
return 0
return len([i for i in s if i == "fail"])
df["date"] = df.datetime.dt.date
result = (
df.groupby("date", as_index=False)
.status
.agg({
"success": count_successes,
"fail": count_failures,
"interrupted": count_interrupteds,
})
)
This produces your example output. You can see how the functions I've defined to generate each column are quite simple but could become arbitrarily complex.
To group by session each day instead, use groupby(["date", "session"]
in place of groupby("date"
.
Upvotes: 1
Reputation: 862851
My idea is convert values of statust to ordered categories, sorting and get row only with most important value which is defined in list passed to parameter categories
:
print (df)
datetime session try status
0 2020-09-17 10:00:01 1a 1a_1 success
1 2020-09-17 10:00:02 2a 2a_1 fail
2 2020-09-17 10:00:03 2a 2a_2 success
3 2020-09-17 10:00:03 3a 3a_1 interrupted
4 2020-09-18 13:00:04 4a 4a_1 fail
5 2020-09-19 10:00:01 1a 1a_1 interrupted
6 2020-09-19 10:00:02 1a 2a_1 fail
7 2020-09-19 10:00:03 2a 2a_2 success
8 2020-09-19 10:00:03 2a 3a_1 interrupted
df['status'] = pd.Categorical(df['status'],
ordered=True,
categories=['success','interrupted','fail'])
df['date'] = df['datetime'].dt.date
df1 = df.sort_values(['date','status']).drop_duplicates(['date','session'])
print (df1)
datetime session try status date
0 2020-09-17 10:00:01 1a 1a_1 success 2020-09-17
2 2020-09-17 10:00:03 2a 2a_2 success 2020-09-17
3 2020-09-17 10:00:03 3a 3a_1 interrupted 2020-09-17
4 2020-09-18 13:00:04 4a 4a_1 fail 2020-09-18
7 2020-09-19 10:00:03 2a 2a_2 success 2020-09-19
5 2020-09-19 10:00:01 1a 1a_1 interrupted 2020-09-19
piv = pd.crosstab(df1['date'], df1['status'])
print (piv)
status success interrupted fail
date
2020-09-17 2 1 0
2020-09-18 0 0 1
2020-09-19 1 1 0
Upvotes: 2