How to find the previous occurrence of a CSV row using Python

Question

I have simplified my CSV as below:

Job,Task,Operator,Comment,Time,Required
Forklift,Crating,John,Daily Task,2020/11/23 00:14:59.280,Y
Truck,Crating,Steve,Daily Task,2020/11/23 01:14:59.280,N
Truck,Storage,Brain,Daily Task,2020/11/23 02:14:59.280,Y
Forklift,Crating,John,Daily Task,2020/11/23 03:14:59.280,Y
Shop Floor,Stacking,Steve,Weekly,2020/11/23 04:14:59.280,N
Truck,Storage,Brain,Daily Task,2020/11/23 06:14:59.280,Y
Forklift,Crating,John,Daily Task,2020/11/23 08:14:59.280,Y
Forklift,Crating,John,Daily Task,2020/11/23 10:14:59.280,Y
Truck,Storage,Brain,Daily Task,2020/11/23 12:14:59.280,Y

I'm looking to get find the required tasks and the previous start time for this tasks. So in this case I would like the output to be something like:

Job         Task        Operator    Comment      Time               Previous Time
Forklift    Crating     John        Daily Task  11/23/2020 3:14     11/23/2020 0:14
Truck       Storage     Brain       Daily Task  11/23/2020 6:14     11/23/2020 2:14

So far I have my code finding the required tasks and assigning the cells to pandas DFs. But I can't figure out how to loop back through the csv to find the previous time for matching rows.

import csv
import pandas
import datetime
from datetime import datetime, timedelta

Job=[]
Task=[]
Oper=[]
Time=[]
Comment=[]

with open('Job.csv','r') as source:
    reader = csv.reader((line.replace('\0','') for line in source), delimiter=",", quotechar='"')
    next(reader, None)
    count = 0
    CurrentTime = datetime.now()
    Last24HrsTime = CurrentTime - timedelta(hours=24)
    print(Last24HrsTime)
    for row in reader:
            count += 1
            Required_Row = row[5]
            JobStartTime = row[4]
            if JobStartTime in (None, ""):
                JobStartTimeFormat = "2019/09/08 01:01:59.280" #In case there is an error and the date is blank.
            else:
                JobStartTimeFormat = datetime.strptime(JobStartTime, '%Y/%m/%d %H:%M:%S.%f')
                JobStartTimeFormatPrint = datetime.strftime(JobStartTimeFormat, '%Y/%m/%d %H:%M')
            if Required_Row == "Y" and JobStartTimeFormat >= Last24HrsTime:    #Is the job required and Job start is in the last 24hrs?
                JobID = row[0]
                TaskID = row[1]
                OperID = row[2]
                SearchRow = count + 1
                Job.append({'Job': row[1]})
                Task.append({'Task': row[2]})
                Oper.append({'Oper': row[2]})
                Time.append({'Tool': 'JobStartTimeFormatPrint'})
                Comment.append({'Comment': row[3]})

##Need to loop here to find preivous values

Current output of this is:

Forklift Crating John 2020/11/23 00:14 Daily Task
Truck Storage Brain 2020/11/23 02:14 Daily Task
Forklift Crating John 2020/11/23 03:14 Daily Task
Truck Storage Brain 2020/11/23 06:14 Daily Task
Forklift Crating John 2020/11/23 08:14 Daily Task
Forklift Crating John 2020/11/23 10:14 Daily Task
Truck Storage Brain 2020/11/23 12:14 Daily Task

Any ideas how to loop through and just find the pervious occurrence?

Gijs Wobben · Accepted Answer

I wouldn't use loops in this case, it gets a little messy. What you want is per group (e.g. task) the previous time. The previous time is defined as: given that the rows are sorted by time, the previous row. Getting the previous row can be done with .shift(1).

Full example:

import csv
import pandas

# Convert data to a list of dictionaries
with open("my_csv_file.csv", "r") as f:
    reader = csv.DictReader(f)
    data = [dict(line) for line in reader]

df = pandas.DataFrame(data)

# Parse time field
df["Time"] = pandas.to_datetime(df["Time"])

# Select only required tasks
df = df.loc[df["Required"] == "Y"]


def get_previous_time_per_group(group: pandas.DataFrame) -> pandas.DataFrame:
    """ Method to get the previous time as a column per group.

    Args:
        group (pandas.DataFrame): The input group (created by a groupby command)

    Returns:
        pandas.DataFrame: The resulting group with the previous time column
    """

    # Sort by the time column
    group = group.sort_values("Time")

    # Get the previous time by shifting the time column by 1 position
    group["Previous Time"] = group["Time"].shift(1)

    # Return the group, including the previous time
    return group


# Group the dataframe by task (or whatever you want) and per group get the previous time column
df = df.groupby("Task").apply(get_previous_time_per_group)

# Clean up the index created by the groupby
df = df.droplevel(0).sort_index()

Full example:

import io
import csv

import pandas

f = io.StringIO("""Job,Task,Operator,Comment,Time,Required
Forklift,Crating,John,Daily Task,2020/11/23 00:14:59.280,Y
Truck,Crating,Steve,Daily Task,2020/11/23 01:14:59.280,N
Truck,Storage,Brain,Daily Task,2020/11/23 02:14:59.280,Y
Forklift,Crating,John,Daily Task,2020/11/23 03:14:59.280,Y
Shop Floor,Stacking,Steve,Weekly,2020/11/23 04:14:59.280,N
Truck,Storage,Brain,Daily Task,2020/11/23 06:14:59.280,Y
Forklift,Crating,John,Daily Task,2020/11/23 08:14:59.280,Y
Forklift,Crating,John,Daily Task,2020/11/23 10:14:59.280,Y
Truck,Storage,Brain,Daily Task,2020/11/23 12:14:59.280,Y
""")

# Convert data to a list of dictionaries
reader = csv.DictReader(f)
data = [dict(line) for line in reader]
df = pandas.DataFrame(data)

# Parse time field
df["Time"] = pandas.to_datetime(df["Time"])

# Select only required tasks
df = df.loc[df["Required"] == "Y"]


def get_previous_time_per_group(group: pandas.DataFrame) -> pandas.DataFrame:
    """ Method to get the previous time as a column per group.

    Args:
        group (pandas.DataFrame): The input group (created by a groupby command)

    Returns:
        pandas.DataFrame: The resulting group with the previous time column
    """

    # Sort by the time column
    group = group.sort_values("Time")

    # Get the previous time by shifting the time column by 1 position
    group["Previous Time"] = group["Time"].shift(1)

    # Return the group, including the previous time
    return group


# Group the dataframe by task (or whatever you want) and per group get the previous time column
df = df.groupby("Task").apply(get_previous_time_per_group)

# Clean up the index created by the groupby
df = df.droplevel(0).sort_index()

Output


print(df)
        Job     Task Operator     Comment                    Time Required  \
0  Forklift  Crating     John  Daily Task 2020-11-23 00:14:59.280        Y   
2     Truck  Storage    Brain  Daily Task 2020-11-23 02:14:59.280        Y   
3  Forklift  Crating     John  Daily Task 2020-11-23 03:14:59.280        Y   
5     Truck  Storage    Brain  Daily Task 2020-11-23 06:14:59.280        Y   
6  Forklift  Crating     John  Daily Task 2020-11-23 08:14:59.280        Y   
7  Forklift  Crating     John  Daily Task 2020-11-23 10:14:59.280        Y   
8     Truck  Storage    Brain  Daily Task 2020-11-23 12:14:59.280        Y   

            Previous Time  
0                     NaT  
2                     NaT  
3 2020-11-23 00:14:59.280  
5 2020-11-23 02:14:59.280  
6 2020-11-23 03:14:59.280  
7 2020-11-23 08:14:59.280  
8 2020-11-23 06:14:59.280

NOTE: If there is no previous time in the data, the value will be NaT (Not a Time)

How to find the previous occurrence of a CSV row using Python

Answers (1)

Related Questions