Reputation: 1145
(newbie to python and pandas)
I have a data set of 15 to 20 million rows, each row is a time-indexed observation of a time a 'user' was seen, and I need to analyze the visit-per-day patterns of each user, normalized to their first visit. So, I'm hoping to plot with an X axis of "days after first visit" and a Y axis of "visits by this user on this day", i.e., I need to get a series indexed by a timedelta and with values of visits in the period ending with that delta [0:1, 3:5, 4:2, 6:8,] But I'm stuck very early ...
I start with something like this:
rng = pd.to_datetime(['2000-01-01 08:00', '2000-01-02 08:00',
'2000-01-01 08:15', '2000-01-02 18:00',
'2000-01-02 17:00', '2000-03-01 08:00',
'2000-03-01 08:20','2000-01-02 18:00'])
uid=Series(['u1','u2','u1','u2','u1','u2','u2','u3'])
misc=Series(['','x1','A123','1.23','','','','u3'])
df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
df=df.set_index(df.ts)
grouped = df.groupby('uid')
firstseen = grouped.first()
The ts
values are unique to each uid
, but can be duplicated (two uid
can be seen at the same time, but any one uid
is seen only once at any one timestamp)
The first step is (I think) to add a new column to the DataFrame, showing for each observation what the timedelta is back to the first observation for that user. But, I'm stuck getting that column in the DataFrame. The simplest thing I tried gives me an obscure-to-newbie error message:
df['sinceseen'] = df.ts - firstseen.ts[df.uid]
...
ValueError: cannot reindex from a duplicate axis
So I tried a brute-force method:
def f(row):
return row.ts - firstseen.ts[row.uid]
df['sinceseen'] = Series([{idx:f(row)} for idx, row in df.iterrows()], dtype=timedelta)
In this attempt, df
gets a sinceseen
but it's all NaN
and shows a type of float
for type(df.sinceseen[0])
- though, if I just print the Series (in iPython) it generates a nice list of timedeltas
.
I'm working back and forth through "Python for Data Analysis" and it seems like apply()
should work, but
def fg(ugroup):
ugroup['sinceseen'] = ugroup.index - ugroup.index.min()
return ugroup
df = df.groupby('uid').apply(fg)
gives me a TypeError
on the "ugroup.index - ugroup.index.min(
" even though each of the two operands is a Timestamp
.
So, I'm flailing - can someone point me at the "pandas" way to get to the data structure Ineed?
Upvotes: 0
Views: 214
Reputation: 353019
Does this help you get started?
>>> df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
>>> df = df.sort(["uid", "ts"])
>>> df["since_seen"] = df.groupby("uid")["ts"].apply(lambda x: x - x.iloc[0])
>>> df
misc ts uid since_seen
0 2000-01-01 08:00:00 u1 0 days, 00:00:00
2 A123 2000-01-01 08:15:00 u1 0 days, 00:15:00
4 2000-01-02 17:00:00 u1 1 days, 09:00:00
1 x1 2000-01-02 08:00:00 u2 0 days, 00:00:00
3 1.23 2000-01-02 18:00:00 u2 0 days, 10:00:00
5 2000-03-01 08:00:00 u2 59 days, 00:00:00
6 2000-03-01 08:20:00 u2 59 days, 00:20:00
7 u3 2000-01-02 18:00:00 u3 0 days, 00:00:00
[8 rows x 4 columns]
Upvotes: 3