Reputation: 19
i have a dataframe like :
uci_class doc_id sentence_id token
1 1 1 Emmanuel Macron
1 1 1 est
1 1 2 president
1 1 2 de
1 1 1 Emmanuel Macron
1 1 2 aussi
1 1 2 president
i want to have in output:
uci_class doc_id sentence_id count
1 1 1 2
1 1 2 2
1 2 1 1
1 2 2 2
for example for the first row we have count=2 because if we do a group by (uci_class doc_id sentence_id) we will have two rows with (uci_class=1 , doc_id=1 and sentence_id=1)
that i want to do , i want to do a group by
Upvotes: 1
Views: 45
Reputation: 372
sure, just use the .groupby method which is documented here.
import pandas as pd
df = pd.DataFrame({
'uci_class': ['1','1','1','1','1','1','1'],
'doc_id': ['1','1','1','1','1','1','1'],
'sentence_id': ['1','1','2','2','1','2','2'],
'token': ['Emmanuel Macron', 'est', 'president', 'de', 'Emmanuel Macron','aussi','president']
})
df_grouped = df.groupby(['uci_class','doc_id','sentence_id']).count().reset_index()
print(df_grouped)
As an aside, I see that you are working with natural language processing. I recommend using a library that handles "tokenization" or word-based analysis a bit more gracefully that pandas will for you. Check out nltk, if you haven't already. To spill the beans, my first-ever experience with python was teaching myself how to use nltk for a project I had in college. Good luck on your work!
Upvotes: 2