Reputation: 957
I have written the following code which creates two dataframes nq
and cmnt
.
nq
contains the UserId
and corresponding time of Badge Attainment date
.
cmnt
contains OwnerUserId
and the time when the User made a comment CreationDate
.
I want to get a count of the comments made for all days before and after 1 week of badge attainment so that I can create a time series line plot out of it.
The following code perform the same but produces a KeyError. Please provide a code that performs this operations for all users.
nq
UserId | date
1 2009-10-17 17:38:32.590
2 2009-10-19 00:37:23.067
3 2009-10-20 08:37:14.143
4 2009-10-21 18:07:51.247
5 2009-10-22 21:25:24.483
cmnt
OwnerUserId | CreationDate
1 2009-10-16 17:38:32.590
1 2009-10-18 17:38:32.590
2 2009-10-18 00:37:23.067
2 2009-10-17 00:37:23.067
2 2009-10-20 00:37:23.067
3 2009-10-19 08:37:14.143
4 2009-10-20 18:07:51.247
5 2009-10-21 21:25:24.483
code
nq.date = pd.to_datetime(nq.date)
cmnt.CreationDate = pd.to_datetime(cmnt.CreationDate)
count= []
for j in range(len(nq)):
for i in range(-7,8):
check_date = nq.date.iloc[j] + timedelta(days=i)
count = cmnt.loc[(cmnt.OwnerUserId == nq.UserId.iloc[j]) & (cmnt.CreationDate == check_date)].shape[0]
nq.iloc[j].append({nq[i]:count})
expected output
UserId | date |-7|-6|-5|-4|-3|-2|-1|0 |1 |2 |3 |4 |5 |6 |7
1 2009-10-17 17:38:32.590 |0 |0 |0 |0 |0 |0 |1 |0 |1 |0 |0 |0 |0 |0 |0
2 2009-10-19 00:37:23.067 |0 |0 |0 |0 |0 |1 |1 |0 |1 |0 |0 |0 |0 |0 |0
3 2009-10-20 08:37:14.143 |0 |0 |0 |0 |0 |0 |1 |0 |0 |0 |0 |0 |0 |0 |0
4 2009-10-21 18:07:51.247 |0 |0 |0 |0 |0 |0 |1 |0 |0 |0 |0 |0 |0 |0 |0
5 2009-10-22 21:25:24.483 |0 |0 |0 |0 |0 |0 |1 |0 |0 |0 |0 |0 |0 |0 |0
Here column -1
means comment made 1 day before badge attainment and 1
means comment made one day after badge attainment and so on.
Note There can be a completely alternately way to do this. My main objective is to draw a time series line plot which shows the number of comments made by the users before and after attainment of the badge.
Upvotes: 0
Views: 163
Reputation: 12503
Here's a way to do that:
t = pd.merge(nq, cmnt, left_on="UserId", right_on = "OwnerUserId")
t["days_diff"] = (t["CreationDate"] - t["date"]).dt.days
t["count"] = t.groupby(["UserId", "days_diff"]).OwnerUserId.transform("count")
all_days = pd.DataFrame(itertools.product(t.UserId.unique(), range(-7, 8)), )
all_days.columns = ["UserId", "day"]
t = pd.merge(t, all_days, left_on=["UserId", "days_diff"], right_on=["UserId", "day"], how = "right")
t = pd.pivot_table(t, index="UserId", columns="day", values="count", dropna=False)
res = pd.merge(nq, t, left_on="UserId", right_index=True)
print(res)
The output is:
UserId date -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7
0 1 2009-10-17 17:38:32.590 NaN NaN NaN NaN NaN NaN 1.0 NaN 1.0 NaN NaN NaN NaN NaN NaN
1 2 2009-10-19 00:37:23.067 NaN NaN NaN NaN NaN 1.0 1.0 NaN 1.0 NaN NaN NaN NaN NaN NaN
2 3 2009-10-20 08:37:14.143 NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
3 4 2009-10-21 18:07:51.247 NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
4 5 2009-10-22 21:25:24.483 NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
Upvotes: 1