Reputation: 1679
I am new to optimization and need help improving the run time of this code. It accomplishes my task, but it takes forever. Any suggestions on improving it so it runs faster?
Here is the code:
def probabilistic_word_weighting(df, lookup):
# instantiate new place holder for class weights for each text sequence in the df
class_probabilities = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
for index, row in lookup.iterrows():
if row.word in df.words.split():
class_proba_ = row.class_proba.strip('][').split(', ')
class_proba_ = [float(i) for i in class_proba_]
class_probabilities = [a + b for a, b in zip(class_probabilities, class_proba_)]
return class_probabilities
The two input df's look like this:
df
index word
1 i havent been back
2 but its
3 they used to get more closer
4 no way
5 when we have some type of a thing for
6 and she had gone to the doctor
7 suze
8 the only time the parents can call is
9 i didnt want to go on a cruise
10 people come aint got
lookup
index word class_proba
6231 been [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487]
8965 havent [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487]
3270 derive [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487]
7817 a [0.0, 0.0, 7.451379, 6.552, 0.0, 0.0, 0.0, 0.0]
3452 hello [0.0, 0.0, 0.0, 0.0, 0.000155327, 0.0, 0.0, 0.0]
5112 they [0.0, 0.0, 0.00032289312, 0.0, 0.0, 0.0, 0.0, 0.0]
1012 time [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487]
7468 some [0.000193199, 0.0, 0.0, 0.000212947, 0.0, 0.0, 0.0, 0.0]
6428 people [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487
5537 scuba [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.27899487
What its doing is essentially iterating through each row in lookup, which contains a word and its relative class weights. If the word is found in any text sequence in df.word, then the class_probabilities for lookup.word get added to the class_probabilities variable assigned to each sequence in df.word. Its looping through every row in df for every iteration on lookup rows.
How can this be done faster?
Upvotes: 1
Views: 272
Reputation: 29635
IIUC, you are using df.apply
with your function, but you can do it like this. the idea is not to redo the operation on the rows of lookup
each time you find a corresponding word but doing it once and reshape df
to be able to perform vectorized manipulation
1: reshape the column words of df
with str.split
, stack
and to_frame
to get a new line for each word:
s_df = df['words'].str.split(expand=True).stack().to_frame(name='split_word')
print (s_df.head(8))
split_word
0 0 i
1 havent
2 been
3 back
1 0 but
1 its
2 0 they
1 used
2: Reshape lookup
by set_index
the word column, str.strip
, str.split
and astype
to get a dataframe with word as index and each value of class_proba in a column
split_lookup = lookup.set_index('word')['class_proba'].str.strip('][')\
.str.split(', ', expand=True).astype(float)
print (split_lookup.head())
0 1 2 3 4 5 6 7
word
been 0.0 0.0 0.000000 0.000 0.000000 0.0 0.0 5.278995
havent 0.0 0.0 0.000000 0.000 0.000000 0.0 0.0 5.278995
derive 0.0 0.0 0.000000 0.000 0.000000 0.0 0.0 5.278995
a 0.0 0.0 7.451379 6.552 0.000000 0.0 0.0 0.000000
hello 0.0 0.0 0.000000 0.000 0.000155 0.0 0.0 0.000000
3: Merge
both, drop
the unnecessary column and groupby
the level=0 being the original index of df
and sum
df_proba = s_df.merge(split_lookup, how='left',
left_on='split_word', right_index=True)\
.drop('split_word', axis=1)\
.groupby(level=0).sum()
print (df_proba.head())
0 1 2 3 4 5 6 7
0 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 10.55799
1 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.00000
2 0.000000 0.0 0.000323 0.000000 0.0 0.0 0.0 0.00000
3 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.00000
4 0.000193 0.0 7.451379 6.552213 0.0 0.0 0.0 0.00000
4: finally, convert to a list and reassign to the original df with to_numpy
and tolist
:
df['class_proba'] = df_proba.to_numpy().tolist()
print (df.head())
words \
0 i havent been back
1 but its
2 they used to get more closer
3 no way
4 when we have some type of a thing for
class_proba
0 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 10.55798974]
1 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
2 [0.0, 0.0, 0.00032289312, 0.0, 0.0, 0.0, 0.0, ...
3 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
4 [0.000193199, 0.0, 7.451379, 6.552212946999999...
Upvotes: 3