Reputation: 43
I am fairly new to python and coding. I am looking for a way to optimize a nested for loop. The nested for loop I have written works perfectly fine, but it takes a lot of time to run. I have explained the basic idea behind my original code and what I have tried to do, below:
data = [['a', '35-44', 'male', ['b', 'z', 'x']], ['b', '15-24', 'female', ['a', 'z', 'q']], \
['r', '35-44', 'male', ['z', 'a', 'd']], ['q', '15-24', 'female', ['u', 'k', 'b']]]
df = pd.DataFrame(data, columns= ['ID', 'age_group', 'gender', 'matching_ids'])
df is the Dataframe that I am working on. What I want to do is compare each 'ID' in df with every other 'ID' in the same df and check if it follows certain conditions.
If these conditions are met I need to append that row to a separate dataframe (sample_df) This is the code with the nested for loop that works fine:
df_copy = df.copy()
sample_df = pd.DataFrame()
for i in range(len(df)):
for j in range(len(df)):
if (i!=j) and (df.iloc[i]['ID'] in df_copy.iloc[j]['matching_ids']) and \
(df.iloc[i]['gender'] == df_copy.iloc[j]['gender']) and\
(df.iloc[i]['age_group'] == df_copy.iloc[j]['age_group']):
sample_df = sample_df.append(df_copy.iloc[[j]])
I tried simplifying it by writing a function and using df.apply(func), but it still takes almost the same amount of time. Below is the code written with using a function:
sample_df_func = pd.DataFrame()
def func_extract(x):
for k in range(len(df)):
if (x['ID'] != df_copy.iloc[k]['ID']) and (x['ID'] in df_copy.iloc[k]['matching_ids']) and \
(x['gender'] == df_copy.iloc[k]['gender']) and\
(x['age_group'] == df_copy.iloc[k]['age_group']):
global sample_df_func
sample_df_func = sample_df_func.append(df_copy.iloc[[k]])
df.apply(func_extract, axis = 1)
sample_df_func
I am looking for ways to simplify this and optimize it further. Forgive me, if the solution to this is very simple and I am not able to figure it out.
Thanks
PS: I've just started coding 2 months back.
Upvotes: 2
Views: 293
Reputation: 18306
We can form groups over age_group
and gender
to obtain subsets where first two conditions hold automatically. For the third condition, we can explode
the matching_ids
and then check if any
of the ids isin
the ID
and keep those rows within groups only with boolean indexing:
out = (df.groupby(["age_group", "gender"])
.apply(lambda s: s[s.matching_ids.explode().isin(s.ID).groupby(level=0).any()])
.reset_index(drop=True))
where lastly we reset the index to get rid of grouping variables as index,
to get
>>> out
ID age_group gender matching_ids
0 b 15-24 female [a, z, q]
1 q 15-24 female [u, k, b]
2 r 35-44 male [z, a, d]
Upvotes: 3