adding column with per-row computed time difference from group start?

Question

(newbie to python and pandas)

I have a data set of 15 to 20 million rows, each row is a time-indexed observation of a time a 'user' was seen, and I need to analyze the visit-per-day patterns of each user, normalized to their first visit. So, I'm hoping to plot with an X axis of "days after first visit" and a Y axis of "visits by this user on this day", i.e., I need to get a series indexed by a timedelta and with values of visits in the period ending with that delta [0:1, 3:5, 4:2, 6:8,] But I'm stuck very early ...

I start with something like this:

rng = pd.to_datetime(['2000-01-01 08:00', '2000-01-02 08:00',
                  '2000-01-01 08:15', '2000-01-02 18:00',
                  '2000-01-02 17:00', '2000-03-01 08:00',
                  '2000-03-01 08:20','2000-01-02 18:00'])
uid=Series(['u1','u2','u1','u2','u1','u2','u2','u3'])
misc=Series(['','x1','A123','1.23','','','','u3'])
df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
df=df.set_index(df.ts)

grouped = df.groupby('uid')
firstseen = grouped.first()

The ts values are unique to each uid, but can be duplicated (two uid can be seen at the same time, but any one uid is seen only once at any one timestamp)

The first step is (I think) to add a new column to the DataFrame, showing for each observation what the timedelta is back to the first observation for that user. But, I'm stuck getting that column in the DataFrame. The simplest thing I tried gives me an obscure-to-newbie error message:

df['sinceseen'] = df.ts - firstseen.ts[df.uid]
...
ValueError: cannot reindex from a duplicate axis

So I tried a brute-force method:

def f(row):
    return row.ts - firstseen.ts[row.uid]
df['sinceseen'] = Series([{idx:f(row)} for idx, row in df.iterrows()], dtype=timedelta)

In this attempt, df gets a sinceseen but it's all NaN and shows a type of float for type(df.sinceseen[0]) - though, if I just print the Series (in iPython) it generates a nice list of timedeltas.

I'm working back and forth through "Python for Data Analysis" and it seems like apply() should work, but

def fg(ugroup):
    ugroup['sinceseen'] = ugroup.index - ugroup.index.min()
    return ugroup

df = df.groupby('uid').apply(fg)

gives me a TypeError on the "ugroup.index - ugroup.index.min(" even though each of the two operands is a Timestamp.

So, I'm flailing - can someone point me at the "pandas" way to get to the data structure Ineed?

DSM · Accepted Answer

Does this help you get started?

>>> df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
>>> df = df.sort(["uid", "ts"])
>>> df["since_seen"] = df.groupby("uid")["ts"].apply(lambda x: x - x.iloc[0])
>>> df
   misc                  ts uid        since_seen
0       2000-01-01 08:00:00  u1  0 days, 00:00:00
2  A123 2000-01-01 08:15:00  u1  0 days, 00:15:00
4       2000-01-02 17:00:00  u1  1 days, 09:00:00
1    x1 2000-01-02 08:00:00  u2  0 days, 00:00:00
3  1.23 2000-01-02 18:00:00  u2  0 days, 10:00:00
5       2000-03-01 08:00:00  u2 59 days, 00:00:00
6       2000-03-01 08:20:00  u2 59 days, 00:20:00
7    u3 2000-01-02 18:00:00  u3  0 days, 00:00:00

[8 rows x 4 columns]

adding column with per-row computed time difference from group start?

Answers (1)

Related Questions