Sheron
Sheron

Reputation: 615

Python extract a new dataframe

I have a dataframe:

  topic  student level week
    1      a       1     1
    1      b       2     1
    1      a       3     1
    2      a       1     2
    2      b       2     2
    2      a       3     2
    2      b       4     2
    3      c       1     2
    3      b       2     2
    3      c       3     2
    3      a       4     2
    3      b       5     2

I would like to extract a number of messages of student in the topic and to create a new df containing three columns:

  student  topic  messages
     a       1        2
     a       2        2
     a       3        1
     b       1        1
     b       2        2
     b       3        2
     c       3        2

I would like to skip rows having 0 messages.

Does anyone have some suggestion?

Thank you!

Upvotes: 2

Views: 222

Answers (3)

piRSquared
piRSquared

Reputation: 294278

Thinking outside the box

Using Counter

import pandas as pd
from collections import Counter

s = pd.Series(Counter(zip(df.student, df.topic)), name='messages')
s.rename_axis(['student', 'topic']).reset_index()

  student  topic  messages
0       a      1         2
1       a      2         2
2       a      3         1
3       b      1         1
4       b      2         2
5       b      3         2
6       c      3         2

Using pd.value_counts

s = pd.value_counts(list(zip(df.student, df.topic)))

pd.DataFrame(
    np.column_stack([s.index.tolist(), s.values]),
    columns=['student', 'topic', 'messages'])

  student  topic  messages
0       a      1         2
1       a      2         2
2       a      3         1
3       b      1         1
4       b      2         2
5       b      3         2
6       c      3         2

Upvotes: 3

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210842

you can do it this way:

In [132]: df.groupby(['student','topic']).size().to_frame('messages').reset_index()
Out[132]:
  student  topic  messages
0       a      1         2
1       a      2         2
2       a      3         1
3       b      1         1
4       b      2         2
5       b      3         2
6       c      3         2

Timing:

In [208]: df = pd.concat([df] * 10**4, ignore_index=True)

In [209]: df.shape
Out[209]: (120000, 4)

In [210]: %timeit df.groupby(['student','topic']).size().to_frame('messages').reset_index()
10 loops, best of 3: 32.6 ms per loop

In [211]: %timeit df.groupby(['student','topic']).size().reset_index(name='messages')
10 loops, best of 3: 32.4 ms per loop

In [212]: from collections import Counter

In [213]: %%timeit
     ...: s = pd.Series(Counter(zip(df.student, df.topic)), name='messages')
     ...: s.rename_axis(['student', 'topic']).reset_index()
     ...:
10 loops, best of 3: 90.3 ms per loop

In [214]: %%timeit
     ...: s = pd.value_counts(list(zip(df.student, df.topic)))
     ...: pd.DataFrame(
     ...:     np.column_stack([s.index.tolist(), s.values]),
     ...:     columns=['student', 'topic', 'messages'])
     ...:
10 loops, best of 3: 83.4 ms per loop

Upvotes: 3

jezrael
jezrael

Reputation: 862681

You can use groupby + size + reset_index:

df = df.groupby(['student','topic']).size().reset_index(name='messages')
print (df)
  student  topic  messages
0       a      1         2
1       a      2         2
2       a      3         1
3       b      1         1
4       b      2         2
5       b      3         2
6       c      3         2

Upvotes: 4

Related Questions