TaborlinTheGreat
TaborlinTheGreat

Reputation: 25

Pandas iterate over a DataFrame to get the values from columns according to conditions and put them in a new DataFrame

here are some lines from my DataFrame composed as follows :

Index ['TimeStamp'] ['ThreadID'] ['Start|End'] ['StepIdentifier']
1017 18:44:22,997 [12] Start FetchMasterData
1018 18:44:22,997 [12] Start GetSmrCalculatedMeterData
1020 18:44:22,997 [12] End GetSmrCalculatedMeterData
1021 18:44:22,997 [12] Start GetSmrPhysicalMeterData
1023 18:44:23,013 [12] End GetSmrPhysicalMeterData
1024 18:44:23,013 [12] Start GetSmrMarketDeliveryPointData
1026 18:44:23,013 [12] End GetSmrMarketDeliveryPointData
1027 18:44:23,013 [12] Start GetSmrMarketHeadpointData
1029 18:44:23,013 [12] End GetSmrMarketHeadpointData
1030 18:44:23,013 [12] End FetchMasterData

I need for each Thread to get the Start and End time for each process. We can see that the "Start" and the "End" for a process (like FetchMasterData) do not necessarily follow each other.

In order to create a new DataFrame like this:

['ThreadID'] ['StepIdentifier'] ['Start'] ['End']
    [12]      FetchMasterData  18:44:22,997 18:44:23,013

I tried to compare a line that has a process start with all the following ones until I find the end of the process, then to create a line in the new dataframe :

def iterateDfStartEnd(df):
    df = createDataFrame() // Function to create the DataFrame shown above
    new_df = pd.DataFrame(columns=['ThreadID','StepIdentifier','StartTime','EndTime'])
    j = 1013 // Because i starts at 1012(like j = i + 1) and i ends at 1140
    for i in df.index
        while(df.iloc[i,'End|Start'] == "Start" & df.iloc[j,'End|Start'] == "End"
        & df.iloc[i,'StepIdentifier'] == df.iloc[j,'StepIdentifier']
        & df.iloc[i,'ThreadID'] == df.iloc[j,'ThreadID'] & j < 1141):
            j += 1
        new_df['ThreadID'] = df.iloc[i,'ThreadID']
        new_df['StepIdentifier'] = df.iloc[i,'StepIdentifier']
        new_df['StartTime'] = df.iloc[i,'TimeStamp']
        new_df['EndTime'] = df.iloc[j,'TimeStamp']
        j = i + 1
    return new_df

For finally calculate for each process the time between the operations Start and End and to have a DataFrame like this (just for your information) :

['ThreadID'] ['StepIdentifier'] ['Time']
    [12]      FetchMasterData     16s

I don't understand how to iterate the dataframe by comparing rows as I do, I have errors with the indexes. Does anyone have another way to create the new dataframe ? Thanks everyone !

Edit : Many thanks to the 2 proposed solutions but there is a problem, a "StepIdentifier" is not unique, it can appear several times in the DataFrame but with different start and end times

Like this :

['TimeStamp'] ['ThreadID'] ['Start|End'] ['StepIdentifier']
18:44:22,997 [12] Start FetchMasterData
18:44:22,997 [12] Start GetSmrCalculatedMeterData
18:44:22,997 [12] End GetSmrCalculatedMeterData
18:44:22,997 [12] Start GetSmrPhysicalMeterData
18:44:23,013 [12] End GetSmrPhysicalMeterData
18:44:23,013 [12] End FetchMasterData
18:44:23,013 [12] Start GGetSmrCalculatedMeterData
18:44:23,023 [12] End GetSmrCalculatedMeterData

We can see that the 'StepIdentifier' "GetSmrCalculatedMeterData" appears twice.

Here is a DataFrame to reproduce this display :

df = pd.DataFrame({'TimeStamp': ['18:44:22,997','18:44:22,997','18:44:22,997','18:44:22,997','18:44:23,013','18:44:23,013','18:44:23,013','18:44:23,023'],
                'ThreadID': ['[12]','[12]','[12]','[12]','[12]','[12]','[12]','[12]'],
                'Start|End': ['Start', 'Start', 'End', 'Start', 'End', 'End', 'Start', 'End'],
                'StepIdentifier': ['FetchMasterData','GetSmrCalculatedMeterData','GetSmrCalculatedMeterData','GetSmrPhysicalMeterData','GetSmrPhysicalMeterData','FetchMasterData','GetSmrCalculatedMeterData','GetSmrCalculatedMeterData']})

Upvotes: 2

Views: 328

Answers (2)

theodosis
theodosis

Reputation: 1026

If you'd like to have the same StepIdentifier for the same ThreadID, then you could try something like this:

# sorting to make sure that the start and end time of the same process that took place on the same thread are one below the other
df = df.sort_values(by=['ThreadID', 'StepIdentifier', 'Timestamp'])

# assigning an inner group counter to distinguish same processes that took place on the same thread
df['group'] = df.groupby(['ThreadID', 'StepIdentifier', 'Start|End']).cumcount()

# unstacking to create the new columns
new_df = df.set_index(['ThreadID', 'StepIdentifier', 'group', 'Start|End']).unstack()

Attaching a screenshot of results for your MRE: enter image description here

Upvotes: 2

Vincent Rupp
Vincent Rupp

Reputation: 655

There's a fairly straightforward solution if your dataframe always has exactly one Start and one End time. You do a .groupby() on the ThreadId and the StepIdentifier, and then use the minimum of TimeStamp for Start and the maximum for End:

df = pd.DataFrame({'ts': ['18','19','20','21','22','23'],
                    'id': ['12','12','12','12','12','12'],
                    'se': ['Start', 'Start', 'End', 'Start', 'End', 'End'],
                    'event': ['stomp','cramp','stomp','break','cramp','break']})
g = df.groupby(["id",'event']).agg(Start = ('ts', lambda x: min(x)), End = ('ts', lambda x: max(x)))
g.reset_index()

The output looks like:

    id  event   Start   End
0   12  break   21  23
1   12  cramp   19  22
2   12  stomp   18  20

Upvotes: 1

Related Questions