Reputation: 21
Hope you all are having an excellent week.
So, I was finishing a script that worked really well for an specific use case. The base is as follows:
Funcion cosine_similarity_join:
def cosine_similarity_join(a:pd.DataFrame, b:pd.DataFrame, col_name):
a_len = len(a[col_name])
# all of the "documents" in a 1D array
corpus = np.concatenate([a[col_name].to_numpy(), b[col_name].to_numpy()])
# vectorize the array
tfidf, vectorizer = fit_vectorizer(corpus, 3)
# in this matrix each row represents the str in a and the col is the str from b, value is the cosine similarity
res = cosine_similarity(tfidf[:a_len], tfidf[a_len:])
res_series = pd.DataFrame(res).stack().rename("score")
res_series.index.set_names(['a', 'b'], inplace=True)
# join scores to b
b_scored = pd.merge(left=b, right=res_series, left_index=True, right_on='b').droplevel('b')
# find the indices on which to match, (highest score in each row)
best_match = np.argmax(res, axis=1)
# Join the rest of
res = pd.merge(left=a, right=b_scored, left_index=True, right_index=True, suffixes=('', '_Guess'))
print(res)
df = res.reset_index()
df = df.iloc[df.groupby(by="RefCol")["score"].idxmax()].reset_index(drop=True)
return df
This works like a charm when I do something like:
resulting_df = cosine_similarity_join(df1,df2,'My_col')
But in my case, I need something in the lines of:
big_df = pd.read_csv('some_really_big_df.csv')
some_other_df = pd.read_csv('some_other_small_df.csv')
counter = 0
size = 10000
total_size = len(big_df)
while counter <= total_size:
small_df = big_df[counter:counter+size]
resulting_df = cosine_similarity_join(small_df,some_other_df,'My_col')
counter += size
I already mapped the problem until one specific line in the function:
res = pd.merge(left=a, right=b_scored, left_index=True, right_index=True, suffixes=('', '_Guess'))
Basically this res dataframe is coming out empty and I just cannot understand why (since when I replicate the values outside of the loop it works just fine)...
I looked at the problem for hours now and would gladly accept a new light over the question.
Thank you all in advance!
Upvotes: 0
Views: 34
Reputation: 21
Found the problem!
I just needed to reset the indexes for the join clause - once I create a new small df from the big df, the indexes remain equal to the slice of the big one, thus generating the problem when joining with another df!
So basically all I needed to do was:
while counter <= total_size:
small_df = big_df[counter:counter+size]
small_df = small_df.reset_index()
resulting_df = cosine_similarity_join(small_df,some_other_df,'My_col')
counter += size
I'll leave it here in case it helps someone :)
Cheers!
Upvotes: 1