BlackHat
BlackHat

Reputation: 755

Python Pandas Double for Loop Efficiency

I'm trying to apply a double for loop to solve a problem. Ideally I'll prefer not to use a for loop as it dataset I have is huge and it will take ages to run through the loop. Below is the code:

words_data_set = pandas.DataFrame({'keywords':['wlmart womens book set','microsoft fish sauce','books from walmat store','mens login for facebook fools','mens login for facbook fools','login for twetter boy','apples from cook']})

company_name_list = ['walmart','microsoft','facebook','twitter','amazon','apple']

import pandas    
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import time
print(len(words_data_set),'....rows')
start_time = time.time()


fuzzed_data_final = pandas.DataFrame()
for s in words_data_set.keywords.tolist():

   step1 = words_data_set[words_data_set.keywords == s]
   step1['keywords2'] = step1.keywords.str.split()
   step2 = step1.keywords2.values.tolist()
   step3 = [item for sublist in step2 for item in sublist]
   step3 = pandas.DataFrame(step3)
   step3.columns = ['search_words']
   step3['keywords'] = s


   fuzzed_data = pandas.DataFrame()
   for w in step3.search_words.tolist():
       step4 = step3[step3.search_words == w]
       step5 = pandas.DataFrame(process.extract(w,company_name_list))
       step5.columns = ['w','score']
       if step5.score.max() >= 90:
           w = ''
       else:
           w

       step4['search_words'] = w
       fuzzed_data = fuzzed_data.append(step4)
   fuzzed_data_final = fuzzed_data_final.append(fuzzed_data)

print("--- %s seconds ---" % (time.time() - start_time))

How can I optimize this for speed and efficiency. words_data_set in reality is about 1 million rows. company_name_list in reality is about 2,000 elements.

Upvotes: 0

Views: 183

Answers (1)

Alex Lopatin
Alex Lopatin

Reputation: 692

Try not to create a new temporarily object with pandas when you can just use Python built-in functions. I don't know about the problem that you are trying to solve but if I just clean what looks to me as redundancy the code runs 9 times faster (0.045 vs 0.410 sec):

import pandas
from fuzzywuzzy import process
from operator import itemgetter
import time

words_data_set = pandas.DataFrame({
    'keywords': ['wlmart womens book set',
                 'microsoft fish sauce',
                 'books from walmat store',
                 'mens login for facebook fools',
                 'mens login for facbook fools',
                 'login for twetter boy',
                 'apples from cook']})
company_name_list = [
    'walmart', 'microsoft', 'facebook', 'twitter', 'amazon', 'apple']
print(len(words_data_set), '....rows')
start_time = time.time()
fuzzed_data_final = pandas.DataFrame()
for s in words_data_set.keywords.tolist():
    step3 = pandas.DataFrame(s.split())
    step3.columns = ['search_words']
    step3['keywords'] = s

    fuzzed_data = pandas.DataFrame()
    for w in step3.search_words.tolist():
        step4 = step3[step3.search_words == w]
        if max(process.extract(w, company_name_list), key=itemgetter(1))[1] >= 90:
            w = ''
        default = pandas.options.mode.chained_assignment
        pandas.options.mode.chained_assignment = None
        step4['search_words'] = w
        pandas.options.mode.chained_assignment = default
        fuzzed_data = fuzzed_data.append(step4)
    fuzzed_data_final = fuzzed_data_final.append(fuzzed_data)

print("--- %s seconds ---" % (time.time() - start_time))
print(fuzzed_data_final)

Output now:

7 ....rows
--- 0.04493832588195801 seconds ---
  search_words                       keywords
0                      wlmart womens book set
1       womens         wlmart womens book set
2                      wlmart womens book set
3          set         wlmart womens book set
0                        microsoft fish sauce
1         fish           microsoft fish sauce
2        sauce           microsoft fish sauce
0        books        books from walmat store
1         from        books from walmat store
2                     books from walmat store
3        store        books from walmat store
0         mens  mens login for facebook fools
1        login  mens login for facebook fools
2          for  mens login for facebook fools
3               mens login for facebook fools
4        fools  mens login for facebook fools
0         mens   mens login for facbook fools
1        login   mens login for facbook fools
2          for   mens login for facbook fools
3                mens login for facbook fools
4        fools   mens login for facbook fools
0        login          login for twetter boy
1          for          login for twetter boy
2      twetter          login for twetter boy
3          boy          login for twetter boy
0                            apples from cook
1         from               apples from cook
2         cook               apples from cook

Process finished with exit code 0

Output before:

7 ....rows
/Users/alex/PycharmProjects/game/pandas_double_for_loop_original.py:18: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  step1['keywords2'] = step1.keywords.str.split()
/Users/alex/PycharmProjects/game/pandas_double_for_loop_original.py:36: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  step4['search_words'] = w
--- 0.4108889102935791 seconds ---
  search_words                       keywords
0                      wlmart womens book set
1       womens         wlmart womens book set
2                      wlmart womens book set
3          set         wlmart womens book set
0                        microsoft fish sauce
1         fish           microsoft fish sauce
2        sauce           microsoft fish sauce
0        books        books from walmat store
1         from        books from walmat store
2                     books from walmat store
3        store        books from walmat store
0         mens  mens login for facebook fools
1        login  mens login for facebook fools
2          for  mens login for facebook fools
3               mens login for facebook fools
4        fools  mens login for facebook fools
0         mens   mens login for facbook fools
1        login   mens login for facbook fools
2          for   mens login for facbook fools
3                mens login for facbook fools
4        fools   mens login for facbook fools
0        login          login for twetter boy
1          for          login for twetter boy
2      twetter          login for twetter boy
3          boy          login for twetter boy
0                            apples from cook
1         from               apples from cook
2         cook               apples from cook

Process finished with exit code 0

UPDATE: the answer about double loop efficiency. Here is version 2 program:

import pandas
from fuzzywuzzy import process
import time

lines = [
    'wlmart womens book set', 'microsoft fish sauce',
    'books from walmat store', 'mens login for facebook fools',
    'mens login for facbook fools', 'login for twetter boy',
    'apples from cook'
]
companies = ['walmart', 'microsoft', 'facebook', 'twitter', 'amazon', 'apple']
fuzzed_data_final = pandas.DataFrame()
lines_results = []


def part0():
    counter = 0
    for line in lines:
        for word in line.split():
            counter += 1
    print('Part 0. Count all words.\n', counter, 'words')


def part1():
    for line in lines:
        line_results = []
        for word in line.split():
            match_score_list = process.extractBests(
                word, companies, score_cutoff=90, limit=1)
            line_results.append(True if match_score_list else False)
        lines_results.append(line_results)
    print('Part 1. Match all words.\n', lines_results)


def part2():
    global fuzzed_data_final
    for i, line in enumerate(lines):
        step3 = pandas.DataFrame(line.split())
        step3.columns = ['search_words']
        step3['keywords'] = line

        fuzzed_data = pandas.DataFrame()
        for j, word in enumerate(line.split()):
            step4 = step3[step3.search_words == word]
            w = word
            if lines_results[i][j]:
                w = ''
            default = pandas.options.mode.chained_assignment
            pandas.options.mode.chained_assignment = None
            step4['search_words'] = w
            pandas.options.mode.chained_assignment = default
            fuzzed_data = fuzzed_data.append(step4)
        fuzzed_data_final = fuzzed_data_final.append(fuzzed_data)
    print('Part 2. Create pandas.DataFrame fuzzed_data_final.\n',
          fuzzed_data_final)


def execute(f):
    start_time = time.perf_counter()
    f()
    total_time = time.perf_counter() - start_time
    print("--- %f seconds ---" % total_time)
    rows = 1
    names = 2000
    e = total_time / len(lines) / len(companies) * rows * 1000000. * names
    h = e / 3600
    d = h / 24
    print('Time estimation for %d million rows and %d company names: %d seconds or'
      ' %d hours or %d days'
      % (rows, names, e, h, d))


execute(part0)
execute(part1)
execute(part2)

The output:

Part 0. Count all words.
 28 words
--- 0.000032 seconds ---
Time estimation for 1 million rows and 2000 company names: 1534 seconds or 0 hours or 0 days
Part 1. Match all words.
 [[True, False, True, False], [True, False, False], [False, False, True, False], [False, False, False, True, False], [False, False, False, True, False], [False, False, False, False], [True, False, False]]
--- 0.006723 seconds ---
Time estimation for 1 million rows and 2000 company names: 320165 seconds or 88 hours or 3 days
Part 2. Create pandas.DataFrame fuzzed_data_final.
   search_words                       keywords
0                      wlmart womens book set
1       womens         wlmart womens book set
2                      wlmart womens book set
3          set         wlmart womens book set
0                        microsoft fish sauce
1         fish           microsoft fish sauce
2        sauce           microsoft fish sauce
0        books        books from walmat store
1         from        books from walmat store
2                     books from walmat store
3        store        books from walmat store
0         mens  mens login for facebook fools
1        login  mens login for facebook fools
2          for  mens login for facebook fools
3               mens login for facebook fools
4        fools  mens login for facebook fools
0         mens   mens login for facbook fools
1        login   mens login for facbook fools
2          for   mens login for facbook fools
3                mens login for facbook fools
4        fools   mens login for facbook fools
0        login          login for twetter boy
1          for          login for twetter boy
2      twetter          login for twetter boy
3          boy          login for twetter boy
0                            apples from cook
1         from               apples from cook
2         cook               apples from cook
--- 0.042164 seconds ---
Time estimation for 1 million rows and 2000 company names: 2007804 seconds or 557 hours or 23 days

Process finished with exit code 0

So, just reading 1 million lines and counting all words will take about a half-hour. 88 hours to fuzzy match all words and 23 days for the creation fuzzed_data_final with about 4,000,0000 rows. I will look if this can be optimized.

UPDATE #2: with optimization for the creation fuzzed_data_final

import pandas
from fuzzywuzzy import process
import time

lines = [
    'wlmart womens book set', 'microsoft fish sauce',
    'books from walmat store', 'mens login for facebook fools',
    'mens login for facbook fools', 'login for twetter boy',
    'apples from cook'
]
companies = ['walmart', 'microsoft', 'facebook', 'twitter', 'amazon', 'apple']

start_time = time.perf_counter()

keywords = []
search_words = []
for line in lines:
    line_results = []
    for word in line.split():
        match_score_list = process.extractBests(
            word, companies, score_cutoff=90, limit=1)
        keywords.append(line)
        search_words.append('' if match_score_list else word)
fuzzed_data_final = pandas.DataFrame(
    { 'search_words': pandas.Series(search_words),
      'keywords': pandas.Series(keywords) })

total_time = time.perf_counter() - start_time
print("--- %f seconds ---" % total_time)
rows = 1
names = 2000
e = total_time / len(lines) / len(companies) * rows * 1000000. * names
h = e / 3600
d = h / 24
print('Time estimation for %d million rows and %d company names: %d seconds or'
  ' %d hours or %d days'
  % (rows, names, e, h, d))
print(fuzzed_data_final)

The output:

/usr/local/bin/python3.7 /Users/alex/PycharmProjects/game/pandas_doble_for_loop_v3.py
--- 0.008402 seconds ---
Time estimation for 1 million rows and 2000 company names: 400107 seconds or 111 hours or 4 days
   search_words                       keywords
0                       wlmart womens book set
1        womens         wlmart womens book set
2                       wlmart womens book set
3           set         wlmart womens book set
4                         microsoft fish sauce
5          fish           microsoft fish sauce
6         sauce           microsoft fish sauce
7         books        books from walmat store
8          from        books from walmat store
9                      books from walmat store
10        store        books from walmat store
11         mens  mens login for facebook fools
12        login  mens login for facebook fools
13          for  mens login for facebook fools
14               mens login for facebook fools
15        fools  mens login for facebook fools
16         mens   mens login for facbook fools
17        login   mens login for facbook fools
18          for   mens login for facbook fools
19                mens login for facbook fools
20        fools   mens login for facbook fools
21        login          login for twetter boy
22          for          login for twetter boy
23      twetter          login for twetter boy
24          boy          login for twetter boy
25                            apples from cook
26         from               apples from cook
27         cook               apples from cook

Process finished with exit code 0

47 times faster than the original version. I see one additional trick to improve the performance on 1,000,000 lines of text: use a dictionary for the matched word. Good vocabulary size is about 20,000 words. Each line could have about 10 words. So, 10,000,000/20,000 = 500 repetitions in average for each word.

UPDATE #3: added a dictionary for the matched words

import pandas
from fuzzywuzzy import process
import time

lines = [
    'wlmart womens book set', 'microsoft fish sauce',
    'books from walmat store', 'mens login for facebook fools',
    'mens login for facbook fools', 'login for twetter boy',
    'apples from cook'
]
companies = ['walmart', 'microsoft', 'facebook', 'twitter', 'amazon', 'apple']

start_time = time.perf_counter()

keywords = []
search_words = []
dictionary = {}
for line in lines:
    for word in line.split():
        if word in dictionary:
            score = dictionary[word]
        else:
            match_score_list = process.extractBests(
                word, companies, score_cutoff=90, limit=1)
            score = True if match_score_list else False
            dictionary[word] = True if match_score_list else False
        keywords.append(line)
        search_words.append('' if score else word)
fuzzed_data_final = pandas.DataFrame(
    {'search_words': pandas.Series(search_words),
     'keywords': pandas.Series(keywords)})

total_time = time.perf_counter() - start_time
print("--- %f seconds ---" % total_time)
rows = 1
names = 2000
e = total_time / len(lines) / len(companies) * rows * 1000000. * names
h = e / 3600
d = h / 24
print('Time estimation for %d million rows and %d company names: %d seconds or'
      ' %d hours or %d days' % (rows, names, e, h, d))
print(fuzzed_data_final)

The output:

/usr/local/bin/python3.7 /Users/alex/PycharmProjects/game/pandas_doble_for_loop_v4.py
--- 0.005707 seconds ---
Time estimation for 1 million rows and 2000 company names: 271761 seconds or 75 hours or 3 days
   search_words                       keywords
0                       wlmart womens book set
1        womens         wlmart womens book set
2                       wlmart womens book set
3           set         wlmart womens book set
4                         microsoft fish sauce
5          fish           microsoft fish sauce
6         sauce           microsoft fish sauce
7         books        books from walmat store
8          from        books from walmat store
9                      books from walmat store
10        store        books from walmat store
11         mens  mens login for facebook fools
12        login  mens login for facebook fools
13          for  mens login for facebook fools
14               mens login for facebook fools
15        fools  mens login for facebook fools
16         mens   mens login for facbook fools
17        login   mens login for facbook fools
18          for   mens login for facbook fools
19                mens login for facbook fools
20        fools   mens login for facbook fools
21        login          login for twetter boy
22          for          login for twetter boy
23      twetter          login for twetter boy
24          boy          login for twetter boy
25                            apples from cook
26         from               apples from cook
27         cook               apples from cook

Process finished with exit code 0

It is 69 times faster than the original script. Can we make it 100?

Upvotes: 1

Related Questions