Reputation: 1353

Faster Approach of Double for loop when iterating large list (18,895 elements)

Here's the code:

import csv
import re

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    reader = csv.reader(csv_f)
    city_lst = cities.readlines()

    for row in reader:
        for city in city_lst:
            city = city.strip()
            match = re.search((r'\b{0}\b').format(city), row[0])
            if match:
                writer.writerow(row)
                break

"alcohol_rehab_ltp.csv" has 145 lines, and "cities2.txt" has 18,895 lines (which becomes 18,895 when converted to the list). It takes a while for this process to run, I haven't timed but maybe around 5 minutes. Is there something simple (or more complex) that I am overlooking here, that could make this script run more quickly. I will be using other .csv files to run against the large text file of "cities.txt", and these csv files may have anywhere up to 1000 lines. Any ideas on how to speed things up would be appreciated! Here is csv file:Keywords (144),Avg. CPC,Local Searches,Advertiser Competition

[alcohol rehab san diego],$49.54,90,High
[alcohol rehab dallas],$86.48,110,High
[alcohol rehab atlanta],$60.93,50,High
[free alcohol rehab centers],$11.88,110,High
[christian alcohol rehab centers],–,70,High
[alcohol rehab las vegas],$33.40,70,High
[alcohol rehab cost],$57.37,110,High

some lines from text file:

san diego
dallas
atlanta
dallas
los angeles
denver

Upvotes: 1

Answers (5)

Padraic Cunningham

Reputation: 180512

I think you can use a set and indexing:

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    space = ""
    reader = csv.reader(csv_f)
    # make set of all city names, lookups are 0(1)
    city_set = {line.rstrip() for line in cities}
    output_list = []
    header = next(reader) # skip header
    for row in reader:
        try:
            # names are either first or last with two words preceding or following 
            # so split twice on whitespace from either direction
            if row[0].split(None,2)[-1].rstrip("]") in city_set or row[0].rsplit(None, 2)[0][1:] in city_set:
                output_list.append(row)
        except IndexError as e:
            print(e,row[0])
    writer.writerows(output_list)

Running time is now 0(n) as opposed to quadratic.

Upvotes: 2

Grysik

Reputation: 846

First, as @Shawn Zhang suggests (r'\b{0}\b').format(c.strip()) can be outside loop, and you can create result list, to avoid writing to file in each iteration.

Second, you might try re.compile to compile regular expression, that might improve your performance on regular expression.

Third, try to profile it a little bit to find the bottleneck, e.g. with timeit or other profiler like ica if you have SciPy.

Also, if city is always in first column, and I assume that it's named 'City' why don't you use csv.DictReader() to read csv? I'm sure it's faster than regular expression.

EDIT

As you provided example of your file I get rid of re (because it seems you really don't need them), and got that more than 10 time faster with code as below:

import csv

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    output_list = []
    reader = csv.reader(csv_f)
    city_lst = cities.readlines()

    for row in reader:
        for city in city_lst:
            city = city.strip()
            if city in row[0]:
                output_list.append(row)
    writer.writerows(output_list)

Upvotes: 2

Dan D.

Reputation: 74675

Build a single regex with all the city names:

city_re = re.compile(r'\b('+ '|'.join(c.strip() for c in cities.readlines()) + r')\b')

and then do:

for row in reader:
    match = city_re.search(row[0])
    if match:
        writer.writerow(row)

This will reduce the number of loop iterations from 18895 x 145 to only 18895 with the regex engine doing its best at string prefix matching on those 145 city names.

For your convenience and testing, here is the full listing:

import csv
import re

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    reader = csv.reader(csv_f)

    city_re = re.compile(r'\b('+ '|'.join(c.strip() for c in cities.readlines()) + r')\b')

    for row in reader:
        match = city_re.search(row[0])
        if match:
            writer.writerow(row)

Upvotes: 2

hyades

Reputation: 3170

Please note that I assume that you can use a better way than using re.search for finding city in row, since generally the city will be separated by a delimiter like space. Otherwise it is a complexity greater than O(n*m)

One way could be to use a hashtable.

ht = [0]*MAX

Read all the cities (assuming these are in thousands) and fill up a hashtable

ht[hash(city)] = 1

Now when you iterate across each row in reader,

for row in reader:
    for word in row:
        if ht[hash(word)] == 1:
            # found, do stuff here
            pass

Upvotes: 1

Shawn Zhang

Reputation: 1884

Even though I don't think the loop/IO is big bottleneck,but still if you can try to start with them.

There two tips I could provide: (r'\b{0}\b').format(c.strip()) can be in outside the loop ,that will increase some performance, because we don't have to strip(), format on in each loop.

also ,you don't have to write output result in each loop, instead, you can create a result list ouput_list save the result during the loop and write them once after the loop.

import csv
import re
import datetime

start = datetime.datetime.now()

with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
    open('cities2.txt', 'rb') as cities, \
    open('drug_rehab_city_state.csv', 'wb') as out_csv:
    writer = csv.writer(out_csv, delimiter = ",")
    space = ""
    reader = csv.reader(csv_f)
    city_lst = [(r'\b{0}\b').format(c.strip()) for c in cities.readlines()]
    output_list = []
    for row in reader:
        for city in city_lst:
            #city = city.strip()
            match = re.search(city, row[0])
            if match:
                output_list.append(row)
                break
    writer.writerows(output_list)



end = datetime.datetime.now()

print end -  start

Upvotes: 1

Faster Approach of Double for loop when iterating large list (18,895 elements)

Answers (5)

EDIT

Related Questions