Want to optimize my code for finding out overlapping times in a big amount of records pandas

Question

I have a data table consisting 100000 records with 50 columns, It has a start time and end time value and a equipment key for which records are available. When this nodes are down, their records are stored. so start time is when the node goes down, and end time is when the node is up after getting down. If there are multiple records where we have same equipment key, and start time and end time values which are inside of previous record's start time and end time, then we call it that this new record has overlapping time and we need to ignore them. To find out these overlapping records, I have written a function and apply it on a dataframe, but it's taking a long time. I am not that efficient in optimization, that's why seeking any suggestion regarding this.

sitecode_info = []

def check_overlapping_sitecode(it):
    sitecode = it['equipmentkey']
    fo = it['firstoccurrence']
    ct = it['cleartimestamp']

    if len(sitecode_info) == 0:
        sitecode_info.append({
            'sc': sitecode,
            'fo': fo,
            'ct': ct
        })
        return 0
    else:
        for list_item in sitecode_info:
            for item in list_item.keys():
                if item == 'sc':
                    if list_item[item] == sitecode:
                        # print("matched")
                        if fo >= list_item['fo'] and ct <= list_item['ct'] or \
                                fo >= list_item['fo'] and fo <= list_item['ct'] and ct >= list_item['ct'] or \
                                fo <= list_item['fo'] and ct >= list_item['ct'] or \
                                fo <= list_item['fo'] and ct >= list_item['fo'] and ct <= list_item['ct']:
                            return 1
                        else:
                            sitecode_info.append({
                              'sc': sitecode,
                              'fo': fo,
                              'ct': ct
                            })
                            return 0
                    else:
                        sitecode_info.append({
                            'sc': sitecode,
                            'fo': fo,
                            'ct': ct
                        })
                        return 0

I am calling this as following.

temp_df['false_alarms'] = temp_df.apply(check_overlapping_sitecode, axis=1)

Tanner Martin · Accepted Answer

I think you were just iterating over that list of dictionaries a touch too much.

**EDIT:**Added appending fo's and ct's even if it returns 1 in the method for enhanced accuracy.

'''
setting an empty dictionary. 
this will look like: {sc1: [[fo, ct], [fo, ct]],
                        sc2:[[fo, ct], [fo, ct]]}
the keys are just the site_code, 
this way we don't have to iterate over all of the fo's and ct's, just the ones related to that site code.

'''
sitecode_info = {}

# i set up a dataframe with 200000 rows x 50 columns

def check_overlapping_sitecode(site_code, fo, ct):
    try:
        #try to grab the existing site_code information from sitecode_info dict.
        #if it fails then go ahead and make it while also returning 0 for that site_code
        my_list = sitecode_info[site_code]
        #if it works, go through that site's list. 
        for fo_old, ct_old in my_list:
            #if the first occurence is >= old_first occurenc and <= cleartimestamp
            if fo >= fo_old and fo <= ct_old:
                sitecode_info[site_code].append([fo, ct])
                return 1
            #same but for cleartimestamp instead
            elif ct <= ct_old and ct >= fo_old:
                sitecode_info[site_code].append([fo, ct])
                return 1
            else:
                #if it doesnt overlap at all go ahead and set the key to a list in list
                sitecode_info[site_code].append([fo, ct])
                return 0
    except:
        #set the key to a list in list if it fails
        sitecode_info[site_code] = [[fo, ct]]
        return 0

t = time.time()
"""Here's the real meat and potatoes.
using a lambda function to call method "check_overlapping_sitecode". 
lambda: x where x is row
return the output of check_overlapping_sitecode
"""
temp_df['false_alarms'] = temp_df.apply(lambda x: check_overlapping_sitecode(x['equipmentkey'], x['firstoccurrence'], x['cleartimestamp']), axis=1)
print(time.time()-t)
#this code runs nearly 6 seconds for me. 
#then you can do whatever you want with your DF.

Want to optimize my code for finding out overlapping times in a big amount of records pandas

Answers (1)

Related Questions