user1893148
user1893148

Reputation: 2190

Pandas Dataframe, Apply Function, Return Index

I have a dataframe, df, with two columns: IDs and Dates. It records events for IDs at different dates. Neither field is unique, but rows are unique by combination (no ID has more than one record on the same date).

I have the following function to add a new column to determine, at a given record/date, whether or not (TRUE/FALSE) the ID has another record at any future date:

def f(df):
    count = pd.Series(np.arange(1, len(df)+1), index=df["date"])
    day = count.index.shift(0, freq="D")
    next18month = count.index.shift(3000, freq="D")
    result =  count.asof(next18month).fillna(0).values - count.asof(day).fillna(0).values
    if result[0] > 0:
        return pd.Series(1, df.index)
    else:
        return pd.Series(0, df.index)

Then I can apply the function to my dataframe, grouped by ID:

df["everagain"] = df.groupby("id").apply(f)

It doesn't work. I believe the result[0] is wrong. It works for the first time an ID appears (it counts the second time, tripping a true return), but if there is a second record for a given id, and no third record, it still returns a '1' (True) at the second record. Can someone help with the correct notation?

(Note: 3000 days is enough to count as forever given my dataset).

For example, if df looked like:

   |  ID  |  Date
0  |  A   |  2010-01-01
1  |  A   |  2010-02-01
2  |  A   |  2010-02-15
3  |  B   |  2010-01-01
4  |  C   |  2010-02-01
5  |  C   |  2010-02-15

Then output would hopefully look like:

   |  ID  |  Date        | everagain
0  |  A   |  2010-01-01  | 1
1  |  A   |  2010-02-01  | 1
2  |  A   |  2010-02-15  | 0
3  |  B   |  2010-01-01  | 0
4  |  C   |  2010-02-01  | 1
5  |  C   |  2010-02-15  | 0

Upvotes: 1

Views: 605

Answers (1)

DSM
DSM

Reputation: 353119

I originally thought I could use .groupby("ID").last() but couldn't quite get it to work. (We could do it with transform, of course, but that feels like too much firepower.)

If your data is ordered by date and has contiguous IDs, however, you can simply compare whether ID is equal to the next ID. For example:

>>> df = df.sort(["ID", "Date"])
>>> df
  ID                Date
0  A 2010-01-01 00:00:00
1  A 2010-02-01 00:00:00
2  A 2010-02-15 00:00:00
3  B 2010-01-01 00:00:00
4  C 2010-02-01 00:00:00
5  C 2010-02-15 00:00:00
>>> df["everagain"] = df["ID"] == df["ID"].shift(-1)
>>> df
  ID                Date everagain
0  A 2010-01-01 00:00:00      True
1  A 2010-02-01 00:00:00      True
2  A 2010-02-15 00:00:00     False
3  B 2010-01-01 00:00:00     False
4  C 2010-02-01 00:00:00      True
5  C 2010-02-15 00:00:00     False

If you wanted ones and zeroes instead of True and False, you could use (df["ID"] == df["ID"].shift(-1))*1) or (df["ID"] == df["ID"].shift(-1)).astype(int) to convert them.

Upvotes: 1

Related Questions