Reputation: 2190
I have a dataframe, df, with two columns: IDs and Dates. It records events for IDs at different dates. Neither field is unique, but rows are unique by combination (no ID has more than one record on the same date).
I have the following function to add a new column to determine, at a given record/date, whether or not (TRUE/FALSE) the ID has another record at any future date:
def f(df):
count = pd.Series(np.arange(1, len(df)+1), index=df["date"])
day = count.index.shift(0, freq="D")
next18month = count.index.shift(3000, freq="D")
result = count.asof(next18month).fillna(0).values - count.asof(day).fillna(0).values
if result[0] > 0:
return pd.Series(1, df.index)
else:
return pd.Series(0, df.index)
Then I can apply the function to my dataframe, grouped by ID:
df["everagain"] = df.groupby("id").apply(f)
It doesn't work. I believe the result[0]
is wrong. It works for the first time an ID appears (it counts the second time, tripping a true return), but if there is a second record for a given id, and no third record, it still returns a '1' (True) at the second record. Can someone help with the correct notation?
(Note: 3000 days is enough to count as forever given my dataset).
For example, if df looked like:
| ID | Date
0 | A | 2010-01-01
1 | A | 2010-02-01
2 | A | 2010-02-15
3 | B | 2010-01-01
4 | C | 2010-02-01
5 | C | 2010-02-15
Then output would hopefully look like:
| ID | Date | everagain
0 | A | 2010-01-01 | 1
1 | A | 2010-02-01 | 1
2 | A | 2010-02-15 | 0
3 | B | 2010-01-01 | 0
4 | C | 2010-02-01 | 1
5 | C | 2010-02-15 | 0
Upvotes: 1
Views: 605
Reputation: 353119
I originally thought I could use .groupby("ID").last()
but couldn't quite get it to work. (We could do it with transform
, of course, but that feels like too much firepower.)
If your data is ordered by date
and has contiguous IDs, however, you can simply compare whether ID
is equal to the next ID
. For example:
>>> df = df.sort(["ID", "Date"])
>>> df
ID Date
0 A 2010-01-01 00:00:00
1 A 2010-02-01 00:00:00
2 A 2010-02-15 00:00:00
3 B 2010-01-01 00:00:00
4 C 2010-02-01 00:00:00
5 C 2010-02-15 00:00:00
>>> df["everagain"] = df["ID"] == df["ID"].shift(-1)
>>> df
ID Date everagain
0 A 2010-01-01 00:00:00 True
1 A 2010-02-01 00:00:00 True
2 A 2010-02-15 00:00:00 False
3 B 2010-01-01 00:00:00 False
4 C 2010-02-01 00:00:00 True
5 C 2010-02-15 00:00:00 False
If you wanted ones and zeroes instead of True and False
, you could use (df["ID"] == df["ID"].shift(-1))*1)
or (df["ID"] == df["ID"].shift(-1)).astype(int)
to convert them.
Upvotes: 1