Reputation: 691
This week I decided to dive a bit into pandas. I have a pandas DataFrame with historical IRC logs that looks like this:
timestamp action nick message
2005-11-04 01:44:33 False hack-cclub lex, hey!
2005-11-04 01:44:43 False hack-cclub lol, yea thats broke
2005-11-04 01:44:56 False lex Slashdot - Updated 2005-11-04 00:23:00 | Micro...
2005-11-04 01:44:56 False hack-cclub lex slashdot
2005-11-04 01:45:12 False lex port 666 is doom - doom Id Software (or mdqs o..
2005-11-04 01:45:12 False hack-cclub lex, port 666
2005-11-04 01:45:21 False hitokiri lex, port 23485
2005-11-04 01:45:45 False hitokiri lex, port 1024
2005-11-04 01:45:46 True hack-cclub slaps lex around with a wet fish
There are roughly 5.5M rows and I'm trying to make some basic visualizations like rank over time for the top 25 nicks and that sort of thing. I know I can get the top 25 nicks like this:
df['nick'].value_counts()[:25]
What I want is a rolling count like this:
hack-cclub lex hitokiri
1 0 0
2 0 0
2 1 0
3 1 0
3 2 0
4 2 0
4 2 1
4 2 2
5 2 2
So that I can plot an area graph of messages from the beginning of time for the top 25 nicks. I know I can do this by just iterating over the entire dataframe and keeping a count but since the whole point of doing this is to learn to use pandas I was hoping there would be a more idiomatic way to do it. It would also be nice to have the same data but with ranks rather than running counts like this:
hack-cclub lex hitokiri
1 2 2
1 2 2
1 2 3
1 2 3
1 2 3
1 2 3
1 2 3
1 2 2
1 2 2
Upvotes: 3
Views: 758
Reputation: 862761
IIUC you need crosstab
and cumsum
:
print df[['timestamp', 'nick']]
timestamp nick
0 2005-11-04 01:44:33 hack-cclub
1 2005-11-04 01:44:43 hack-cclub
2 2005-11-04 01:44:56 lex
3 2005-11-04 01:44:56 hack-cclub
4 2005-11-04 01:45:12 lex
5 2005-11-04 01:45:12 hack-cclub
6 2005-11-04 01:45:21 hitokiri
7 2005-11-04 01:45:45 hitokiri
8 2005-11-04 01:45:46 hack-cclub
df = pd.crosstab(df.timestamp, df.nick)
print df
nick hack-cclub hitokiri lex
timestamp
2005-11-04 01:44:33 1 0 0
2005-11-04 01:44:43 1 0 0
2005-11-04 01:44:56 1 0 1
2005-11-04 01:45:12 1 0 1
2005-11-04 01:45:21 0 1 0
2005-11-04 01:45:45 0 1 0
2005-11-04 01:45:46 1 0 0
df = df.cumsum()
print df
nick hack-cclub hitokiri lex
timestamp
2005-11-04 01:44:33 1 0 0
2005-11-04 01:44:43 2 0 0
2005-11-04 01:44:56 3 0 1
2005-11-04 01:45:12 4 0 2
2005-11-04 01:45:21 4 1 2
2005-11-04 01:45:45 4 2 2
2005-11-04 01:45:46 5 2 2
Upvotes: 2