Reputation: 615
I have a dataframe:
topic student level week
1 a 1 1
1 b 2 1
1 a 3 1
2 a 1 2
2 b 2 2
2 a 3 2
2 b 4 2
3 c 1 2
3 b 2 2
3 c 3 2
3 a 4 2
3 b 5 2
I would like to extract a number of messages of student in the topic and to create a new df containing three columns:
student topic messages
a 1 2
a 2 2
a 3 1
b 1 1
b 2 2
b 3 2
c 3 2
I would like to skip rows having 0 messages.
Does anyone have some suggestion?
Thank you!
Upvotes: 2
Views: 222
Reputation: 294278
Thinking outside the box
Using Counter
import pandas as pd
from collections import Counter
s = pd.Series(Counter(zip(df.student, df.topic)), name='messages')
s.rename_axis(['student', 'topic']).reset_index()
student topic messages
0 a 1 2
1 a 2 2
2 a 3 1
3 b 1 1
4 b 2 2
5 b 3 2
6 c 3 2
Using pd.value_counts
s = pd.value_counts(list(zip(df.student, df.topic)))
pd.DataFrame(
np.column_stack([s.index.tolist(), s.values]),
columns=['student', 'topic', 'messages'])
student topic messages
0 a 1 2
1 a 2 2
2 a 3 1
3 b 1 1
4 b 2 2
5 b 3 2
6 c 3 2
Upvotes: 3
Reputation: 210842
you can do it this way:
In [132]: df.groupby(['student','topic']).size().to_frame('messages').reset_index()
Out[132]:
student topic messages
0 a 1 2
1 a 2 2
2 a 3 1
3 b 1 1
4 b 2 2
5 b 3 2
6 c 3 2
Timing:
In [208]: df = pd.concat([df] * 10**4, ignore_index=True)
In [209]: df.shape
Out[209]: (120000, 4)
In [210]: %timeit df.groupby(['student','topic']).size().to_frame('messages').reset_index()
10 loops, best of 3: 32.6 ms per loop
In [211]: %timeit df.groupby(['student','topic']).size().reset_index(name='messages')
10 loops, best of 3: 32.4 ms per loop
In [212]: from collections import Counter
In [213]: %%timeit
...: s = pd.Series(Counter(zip(df.student, df.topic)), name='messages')
...: s.rename_axis(['student', 'topic']).reset_index()
...:
10 loops, best of 3: 90.3 ms per loop
In [214]: %%timeit
...: s = pd.value_counts(list(zip(df.student, df.topic)))
...: pd.DataFrame(
...: np.column_stack([s.index.tolist(), s.values]),
...: columns=['student', 'topic', 'messages'])
...:
10 loops, best of 3: 83.4 ms per loop
Upvotes: 3
Reputation: 862681
You can use groupby
+ size
+ reset_index
:
df = df.groupby(['student','topic']).size().reset_index(name='messages')
print (df)
student topic messages
0 a 1 2
1 a 2 2
2 a 3 1
3 b 1 1
4 b 2 2
5 b 3 2
6 c 3 2
Upvotes: 4