kang
kang

Reputation: 19

How to apply a function by grouping by specific locations in pandas?

logged_at           type    name    values
2020-08-17 00:02:22 weak    AA      55
2020-08-17 00:12:20 weak    AA      54
2020-08-17 00:22:24 weak    AA      53
2020-08-17 00:32:25 weak    AA      50
2020-08-17 00:42:28 strong  AA      44
2020-08-17 00:52:22 strong  AA      33
2020-08-17 01:02:20 strong  AA      32
2020-08-17 01:22:24 weak    AA      56
2020-08-17 01:32:25 weak    AA      55
2020-08-17 01:42:28 weak    AA      43

Above is the dataframe type. What I want to do is to look at each name as a subset until the type is changed and get the value minus the first and last values ​​of the values ​​of each subset.

The final desired shape seems to be this type.

logged_at           type    name    values  rank
2020-08-17 00:02:22 weak    AA      5        1
2020-08-17 00:42:28 strong  AA      12       2
2020-08-17 01:22:24 weak    AA      13       3

Whenever the type is changed, a rank value should be assigned so that each subset can be distinguished, but what I think of now is that a new list is created through itertuples and a type different from the previous type appears while checking the type, raising the rank and storing it in the list, and then a new column. I think of adding it to... I don't know if this is efficient. There are about 300 million rows, so i am looking for the most efficient method.

Upvotes: 2

Views: 53

Answers (2)

sammywemmy
sammywemmy

Reputation: 28644

You can break it in stages :

# get the first values :
top = df.loc[df["type"].ne(df["type"].shift())]

# get the last values
bottom = df.loc[df["type"].ne(df["type"].shift(-1))]

#get the difference in values and generate the rank : 
top.assign(values=top["values"].array - bottom["values"].array, 
           rank=range(1, 1 + len(top)))

    logged_at          type   name  values  rank
0   2020-08-17 00:02:22 weak    AA      5   1
4   2020-08-17 00:42:28 strong  AA      12  2
7   2020-08-17 01:22:24 weak    AA      13  3

You can reset the index. However, with 300 million rows, I do not think Pandas is the right choice.

Upvotes: 1

Gilseung Ahn
Gilseung Ahn

Reputation: 2614

If the dataframe is sorted by time, just remove duplicates as follows.

df.drop_duplicates(by = 'logged_at', keep = 'first')

Upvotes: 0

Related Questions