Nested for-loop optimization while iterating over Dataframes

Question

I am fairly new to python and coding. I am looking for a way to optimize a nested for loop. The nested for loop I have written works perfectly fine, but it takes a lot of time to run. I have explained the basic idea behind my original code and what I have tried to do, below:

data = [['a', '35-44', 'male', ['b', 'z', 'x']], ['b', '15-24', 'female', ['a', 'z', 'q']], \
        ['r', '35-44', 'male', ['z', 'a', 'd']], ['q', '15-24', 'female', ['u', 'k', 'b']]]
df = pd.DataFrame(data, columns= ['ID', 'age_group', 'gender', 'matching_ids'])

df is the Dataframe that I am working on. What I want to do is compare each 'ID' in df with every other 'ID' in the same df and check if it follows certain conditions.

If the age_group is equal.
If the gender is the same.
If the 'ID' is in 'matched_ids'.

If these conditions are met I need to append that row to a separate dataframe (sample_df) This is the code with the nested for loop that works fine:

df_copy = df.copy()
sample_df = pd.DataFrame()
for i in range(len(df)):
    for j in range(len(df)):
        if (i!=j) and (df.iloc[i]['ID'] in df_copy.iloc[j]['matching_ids']) and \
        (df.iloc[i]['gender'] == df_copy.iloc[j]['gender']) and\
        (df.iloc[i]['age_group'] == df_copy.iloc[j]['age_group']):
            sample_df = sample_df.append(df_copy.iloc[[j]])

I tried simplifying it by writing a function and using df.apply(func), but it still takes almost the same amount of time. Below is the code written with using a function:

sample_df_func = pd.DataFrame()
def func_extract(x):
     for k in range(len(df)):
        if (x['ID'] != df_copy.iloc[k]['ID']) and (x['ID'] in df_copy.iloc[k]['matching_ids']) and \
        (x['gender'] == df_copy.iloc[k]['gender']) and\
        (x['age_group'] == df_copy.iloc[k]['age_group']):
            global sample_df_func
            sample_df_func = sample_df_func.append(df_copy.iloc[[k]])
df.apply(func_extract, axis = 1)
sample_df_func

I am looking for ways to simplify this and optimize it further. Forgive me, if the solution to this is very simple and I am not able to figure it out.

Thanks

PS: I've just started coding 2 months back.

Nested for-loop optimization while iterating over Dataframes

Answers (1)

Related Questions