Fast way to intersect a list generated via a for loop from a dataframe with dictionary values using Single Pass algorithm

Question

I am trying to implement the Single Pass algorithm to the following problem:

for each index in the dataframe, identify which columns contain that value
these columns' names are then stored as a list
there's a dictionary where the key is a column name in the dataframe and the value is all the other column names in the dataframe
for each key in that dictionary, intersect its value with the list generated from step 2) above
update the value of that dictionary key with the result of intersection

Please see my code below

def sp_algorithm(self, dataframe, col_dict):

    # for each cursor value, get all columns that has it
    for ind, i in zip(dataframe.index, range(0, len(dataframe.index))):

        # vectorised index operation
        cols_list = dataframe.columns[dataframe.isin([ind]).any()]

        # for each column in col_dict.keys(), intersect its values with cols_list to get the remaining
        # if the current column value is null, then do nothing
        for key_col in col_dict.keys():

            column_val = dataframe.loc[ind, key_col]

            if (column_val == column_val) & (column_val != ''):
                col_dict[key_col] = list(set(col_dict[key_col]).intersection(cols_list))

The dictionary containing column names look like this:

col_dict = {'col_A': ['col_B', 'col_C'], 'col_B' = ['col_A', 'col_C'], 'col_C': ['col_A', 'col_B']}

As you can see, my code is currently of O(n^2) time complexity as there are 2 for loops in there.

Currently, each iteration (including the 2 x for loops) takes around 0.8 seconds, which is probably not a problem for a small dataset. However, the dataset I am processing is of 300k rows and over 80 columns.

My problem is: how do I implement single pass to the dictionary intersection step so there's only going to be 1 for loop instead of 2?

EDIT The dataframe will contain sorted index and values in ascending order as below:

        col_A col_B col_C
index
0        nan     0     0
1          1     nan   1
2          2     nan   2
3        nan     3     3

So, my current function would loop through for each index, ind to get the col names cols_list = dataframe.columns[dataframe.isin([ind]).any()] and intersect these with the dictionary values.

1st iteration: cols_list = ['col_B', 'col_C'] then, it will look up the values for each key in the col_dict (only if the column has a value, so for nan it will be skipped) and intersect it and update it.

col_dict = {'col_A': ['col_B', 'col_C'], 'col_B' = ['col_C'], 'col_C': ['col_B']}

2nd iteration: col_B is skipped when checking dictionary value as it's nan and it's dictionary value stays the same cols_list = ['col_A', 'col_C'] col_dict = {'col_A': ['col_C'], 'col_B' = ['col_C'], 'col_C': []}

3rd iteration: col_B is skipped when checking dictionary value as it's nan and it's dictionary value stays the same cols_list = ['col_A', 'col_C'] col_dict = {'col_A': ['col_C'], 'col_B' = ['col_C'], 'col_C': []}

4th iteration: col_A is skipped when checking dictionary value as it's nan and it's dictionary value stays the same cols_list = ['col_B', 'col_C'] col_dict = {'col_A': ['col_C'], 'col_B' = ['col_C'], 'col_C': []}

Fast way to intersect a list generated via a for loop from a dataframe with dictionary values using Single Pass algorithm

Answers (1)

Related Questions