Reputation: 58
For the sake of this question I simplified the example, but it comes down to the following:
I have a pandas DataFrame that contains logged items. The index is the id for each specific item and there are two columns when (when was the item logged) and logger_id (who logged the item). I want to create a new table, to define the periods between each logged item, for each unique logger.
So the new table would get an index with new ids and there will be two columns. A column matching the id of the log_item at the start of the period, item_start, and a column matching the id of the log_item at the end of the period, item_finish.
The code below works, but I do not like the nested for loop. I know pandas has a lot of build-in functionality to avoid using for loops, but I was not able to find the right one for this case.
import uuid
from datetime import datetime
import pandas as pd
df = pd.DataFrame({
"log_item_id": [str(uuid.uuid4()) for _ in range(6)],
"when": [datetime(2021, 4, d, 12) for d in [1, 8, 15]] * 2,
"logger_id": [str(uuid.uuid4()), str(uuid.uuid4())] * 3,
}).set_index("log_item_id")
periods = pd.DataFrame(columns = ["item_start", "item_finish"])
for logger_id in df.logger_id.unique(): # loop one
filtered_df = df[df.logger_id == logger_id]
for ix, log_id in enumerate(filtered_df.index[1::]): # loop two
previous = filtered_df.index[ix]
index = str(uuid.uuid4())
periods.loc[index, "item_start"] = previous
periods.loc[index, "item_finish"] = log_id
Upvotes: 1
Views: 93
Reputation: 41337
If I understand correctly, you can groupby()
the logger_id
and shift()
the log_item_id
. Then just do some cleanup of the columns/names/index:
df = df.reset_index()
df['item_finish'] = df.groupby('logger_id').log_item_id.shift(-1)
# cleanup
df = df.dropna().rename(columns={'log_item_id': 'item_start'}).sort_values(by='logger_id').drop(columns=['logger_id', 'when'])
df.index = [uuid.uuid4() for _ in range(len(df))]
Input:
log_item_id when logger_id
4c6175fd-0258-492f-bc40-f6a02c5a8103 2021-04-01 12:00:00 f3534047-ee63-40ef-982a-27ca22d36ac0
1433966c-5643-44f3-ba11-ca91cc9867ee 2021-04-08 12:00:00 83399790-394b-4b63-b129-99f7fdb77353
d4848bdf-7e4a-4596-808f-ee5758b71bdd 2021-04-15 12:00:00 f3534047-ee63-40ef-982a-27ca22d36ac0
d9a6ecd6-9a5c-4d4b-9af8-db7bef262ee2 2021-04-01 12:00:00 83399790-394b-4b63-b129-99f7fdb77353
8eb1b911-0cad-42ad-a1f5-a7b46fc03b91 2021-04-08 12:00:00 f3534047-ee63-40ef-982a-27ca22d36ac0
291e333a-c01a-4210-961f-c2ea4c1ef0b9 2021-04-15 12:00:00 83399790-394b-4b63-b129-99f7fdb77353
Output:
item_start item_finish
8ce599d0-4d03-489a-84e3-f417e908b9ab 1433966c-5643-44f3-ba11-ca91cc9867ee d9a6ecd6-9a5c-4d4b-9af8-db7bef262ee2
805020e8-39ab-4113-8b3b-8c74997eb108 d9a6ecd6-9a5c-4d4b-9af8-db7bef262ee2 291e333a-c01a-4210-961f-c2ea4c1ef0b9
a7f8b527-268f-4de6-ba8a-23f9671046c4 4c6175fd-0258-492f-bc40-f6a02c5a8103 d4848bdf-7e4a-4596-808f-ee5758b71bdd
a0a7a765-1f41-4eca-80fd-8d79f32058c7 d4848bdf-7e4a-4596-808f-ee5758b71bdd 8eb1b911-0cad-42ad-a1f5-a7b46fc03b91
>>> df = pd.concat([df] * 1000) # 6000 rows
>>> %timeit groupby_shift(df)
35.2 ms ± 764 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit double_loop(df)
5.26 s ± 441 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 1