vikingd
vikingd

Reputation: 43

Nested for-loop optimization while iterating over Dataframes

I am fairly new to python and coding. I am looking for a way to optimize a nested for loop. The nested for loop I have written works perfectly fine, but it takes a lot of time to run. I have explained the basic idea behind my original code and what I have tried to do, below:

data = [['a', '35-44', 'male', ['b', 'z', 'x']], ['b', '15-24', 'female', ['a', 'z', 'q']], \
        ['r', '35-44', 'male', ['z', 'a', 'd']], ['q', '15-24', 'female', ['u', 'k', 'b']]]
df = pd.DataFrame(data, columns= ['ID', 'age_group', 'gender', 'matching_ids']) 

df is the Dataframe that I am working on. What I want to do is compare each 'ID' in df with every other 'ID' in the same df and check if it follows certain conditions.

  1. If the age_group is equal.
  2. If the gender is the same.
  3. If the 'ID' is in 'matched_ids'.

If these conditions are met I need to append that row to a separate dataframe (sample_df) This is the code with the nested for loop that works fine:

df_copy = df.copy()
sample_df = pd.DataFrame()
for i in range(len(df)):
    for j in range(len(df)):
        if (i!=j) and (df.iloc[i]['ID'] in df_copy.iloc[j]['matching_ids']) and \
        (df.iloc[i]['gender'] == df_copy.iloc[j]['gender']) and\
        (df.iloc[i]['age_group'] == df_copy.iloc[j]['age_group']):
            sample_df = sample_df.append(df_copy.iloc[[j]])

I tried simplifying it by writing a function and using df.apply(func), but it still takes almost the same amount of time. Below is the code written with using a function:

sample_df_func = pd.DataFrame()
def func_extract(x):
     for k in range(len(df)):
        if (x['ID'] != df_copy.iloc[k]['ID']) and (x['ID'] in df_copy.iloc[k]['matching_ids']) and \
        (x['gender'] == df_copy.iloc[k]['gender']) and\
        (x['age_group'] == df_copy.iloc[k]['age_group']):
            global sample_df_func
            sample_df_func = sample_df_func.append(df_copy.iloc[[k]])
df.apply(func_extract, axis = 1)
sample_df_func

I am looking for ways to simplify this and optimize it further. Forgive me, if the solution to this is very simple and I am not able to figure it out.

Thanks

PS: I've just started coding 2 months back.

Upvotes: 2

Views: 293

Answers (1)

Mustafa Aydın
Mustafa Aydın

Reputation: 18306

We can form groups over age_group and gender to obtain subsets where first two conditions hold automatically. For the third condition, we can explode the matching_ids and then check if any of the ids isin the ID and keep those rows within groups only with boolean indexing:

out = (df.groupby(["age_group", "gender"])
         .apply(lambda s: s[s.matching_ids.explode().isin(s.ID).groupby(level=0).any()])
         .reset_index(drop=True))

where lastly we reset the index to get rid of grouping variables as index,

to get

>>> out

  ID age_group  gender matching_ids
0  b     15-24  female    [a, z, q]
1  q     15-24  female    [u, k, b]
2  r     35-44    male    [z, a, d]

Upvotes: 3

Related Questions