Reputation: 4697
I'd like to find the elapsed time since the first time an event was observed. For this I saved each observation in a CSV file. Each event is identified by a unique hash.
Right now I'm doing the following:
from pandas import *
from bz2 import BZ2File
events = DataFrame.from_csv(BZ2File('events.csv.bz2', 'r'), sep='\t', header=0, index_col=None)
m = events.groupby('hash')['timestamp'].min()
at this point I have a Series indexed by the hash and the timestamp of the first observation. How would I use this to get the time offset for each row in the events DataFrame (simply timestamp - min(timestamp))?
Upvotes: 0
Views: 2660
Reputation: 16960
Let me setup a toy example here:
In [38]: rng = pd.date_range('2012-8-1', freq='T', periods=100)
In [39]: hashes = np.random.randint(0, 10, len(rng))
In [40]: obs = np.arange(len(rng))
In [41]: df = DataFrame({'hash' : hashes, 'timestamp' : rng.asobject},
....: index=obs)
Now to get the time difference for each hash:
In [42]: grouped = df.set_index('hash', append=True).groupby(level='hash')
In [44]: grouped.transform(lambda x: x-x.min())
Out[44]:
timestamp
hash
0 3 0:00:00
1 5 0:00:00
2 1 0:00:00
3 8 0:00:00
4 6 0:00:00
5 8 0:02:00
6 1 0:04:00
7 7 0:00:00
8 3 0:08:00
9 5 0:08:00
10 8 0:07:00
11 1 0:09:00
12 2 0:00:00
...
...
...
94 2 1:22:00
95 6 1:31:00
96 1 1:34:00
97 0 1:21:00
98 8 1:35:00
99 0 1:23:00
Upvotes: 2