Reputation: 19
logged_at type name values
2020-08-17 00:02:22 weak AA 55
2020-08-17 00:12:20 weak AA 54
2020-08-17 00:22:24 weak AA 53
2020-08-17 00:32:25 weak AA 50
2020-08-17 00:42:28 strong AA 44
2020-08-17 00:52:22 strong AA 33
2020-08-17 01:02:20 strong AA 32
2020-08-17 01:22:24 weak AA 56
2020-08-17 01:32:25 weak AA 55
2020-08-17 01:42:28 weak AA 43
Above is the dataframe type. What I want to do is to look at each name as a subset until the type is changed and get the value minus the first and last values of the values of each subset.
The final desired shape seems to be this type.
logged_at type name values rank
2020-08-17 00:02:22 weak AA 5 1
2020-08-17 00:42:28 strong AA 12 2
2020-08-17 01:22:24 weak AA 13 3
Whenever the type is changed, a rank value should be assigned so that each subset can be distinguished, but what I think of now is that a new list is created through itertuples and a type different from the previous type appears while checking the type, raising the rank and storing it in the list, and then a new column. I think of adding it to... I don't know if this is efficient. There are about 300 million rows, so i am looking for the most efficient method.
Upvotes: 2
Views: 53
Reputation: 28644
You can break it in stages :
# get the first values :
top = df.loc[df["type"].ne(df["type"].shift())]
# get the last values
bottom = df.loc[df["type"].ne(df["type"].shift(-1))]
#get the difference in values and generate the rank :
top.assign(values=top["values"].array - bottom["values"].array,
rank=range(1, 1 + len(top)))
logged_at type name values rank
0 2020-08-17 00:02:22 weak AA 5 1
4 2020-08-17 00:42:28 strong AA 12 2
7 2020-08-17 01:22:24 weak AA 13 3
You can reset the index. However, with 300 million rows, I do not think Pandas is the right choice.
Upvotes: 1
Reputation: 2614
If the dataframe is sorted by time, just remove duplicates as follows.
df.drop_duplicates(by = 'logged_at', keep = 'first')
Upvotes: 0