Remove double for loop for pandas

Question

For the sake of this question I simplified the example, but it comes down to the following:

I have a pandas DataFrame that contains logged items. The index is the id for each specific item and there are two columns when (when was the item logged) and logger_id (who logged the item). I want to create a new table, to define the periods between each logged item, for each unique logger.

So the new table would get an index with new ids and there will be two columns. A column matching the id of the log_item at the start of the period, item_start, and a column matching the id of the log_item at the end of the period, item_finish.

The code below works, but I do not like the nested for loop. I know pandas has a lot of build-in functionality to avoid using for loops, but I was not able to find the right one for this case.

import uuid
from datetime import datetime

import pandas as pd

df = pd.DataFrame({
    "log_item_id": [str(uuid.uuid4()) for _ in range(6)],
    "when": [datetime(2021, 4, d, 12) for d in [1, 8, 15]] * 2,
    "logger_id": [str(uuid.uuid4()), str(uuid.uuid4())] * 3,
}).set_index("log_item_id")

periods = pd.DataFrame(columns = ["item_start", "item_finish"])
for logger_id in df.logger_id.unique():                               # loop one
    filtered_df = df[df.logger_id == logger_id]
    
    for ix, log_id in enumerate(filtered_df.index[1::]):              # loop two
        previous = filtered_df.index[ix]
        
        index = str(uuid.uuid4())
        periods.loc[index, "item_start"] = previous
        periods.loc[index, "item_finish"] = log_id

tdy · Accepted Answer

If I understand correctly, you can groupby() the logger_id and shift() the log_item_id. Then just do some cleanup of the columns/names/index:

df = df.reset_index()
df['item_finish'] = df.groupby('logger_id').log_item_id.shift(-1)

# cleanup
df = df.dropna().rename(columns={'log_item_id': 'item_start'}).sort_values(by='logger_id').drop(columns=['logger_id', 'when'])
df.index = [uuid.uuid4() for _ in range(len(df))]

Input:

                         log_item_id                 when                             logger_id
4c6175fd-0258-492f-bc40-f6a02c5a8103  2021-04-01 12:00:00  f3534047-ee63-40ef-982a-27ca22d36ac0
1433966c-5643-44f3-ba11-ca91cc9867ee  2021-04-08 12:00:00  83399790-394b-4b63-b129-99f7fdb77353
d4848bdf-7e4a-4596-808f-ee5758b71bdd  2021-04-15 12:00:00  f3534047-ee63-40ef-982a-27ca22d36ac0
d9a6ecd6-9a5c-4d4b-9af8-db7bef262ee2  2021-04-01 12:00:00  83399790-394b-4b63-b129-99f7fdb77353
8eb1b911-0cad-42ad-a1f5-a7b46fc03b91  2021-04-08 12:00:00  f3534047-ee63-40ef-982a-27ca22d36ac0
291e333a-c01a-4210-961f-c2ea4c1ef0b9  2021-04-15 12:00:00  83399790-394b-4b63-b129-99f7fdb77353

Output:

                                                                item_start                           item_finish
8ce599d0-4d03-489a-84e3-f417e908b9ab  1433966c-5643-44f3-ba11-ca91cc9867ee  d9a6ecd6-9a5c-4d4b-9af8-db7bef262ee2
805020e8-39ab-4113-8b3b-8c74997eb108  d9a6ecd6-9a5c-4d4b-9af8-db7bef262ee2  291e333a-c01a-4210-961f-c2ea4c1ef0b9
a7f8b527-268f-4de6-ba8a-23f9671046c4  4c6175fd-0258-492f-bc40-f6a02c5a8103  d4848bdf-7e4a-4596-808f-ee5758b71bdd
a0a7a765-1f41-4eca-80fd-8d79f32058c7  d4848bdf-7e4a-4596-808f-ee5758b71bdd  8eb1b911-0cad-42ad-a1f5-a7b46fc03b91

Timings

>>> df = pd.concat([df] * 1000) # 6000 rows

>>> %timeit groupby_shift(df)
35.2 ms ± 764 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

>>> %timeit double_loop(df)
5.26 s ± 441 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Remove double for loop for pandas

Answers (1)

Timings

Related Questions