Reputation: 1353
Here's the code:
import csv
import re
with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
open('cities2.txt', 'rb') as cities, \
open('drug_rehab_city_state.csv', 'wb') as out_csv:
writer = csv.writer(out_csv, delimiter = ",")
reader = csv.reader(csv_f)
city_lst = cities.readlines()
for row in reader:
for city in city_lst:
city = city.strip()
match = re.search((r'\b{0}\b').format(city), row[0])
if match:
writer.writerow(row)
break
"alcohol_rehab_ltp.csv" has 145 lines, and "cities2.txt" has 18,895 lines (which becomes 18,895 when converted to the list). It takes a while for this process to run, I haven't timed but maybe around 5 minutes. Is there something simple (or more complex) that I am overlooking here, that could make this script run more quickly. I will be using other .csv files to run against the large text file of "cities.txt", and these csv files may have anywhere up to 1000 lines. Any ideas on how to speed things up would be appreciated! Here is csv file:Keywords (144),Avg. CPC,Local Searches,Advertiser Competition
[alcohol rehab san diego],$49.54,90,High
[alcohol rehab dallas],$86.48,110,High
[alcohol rehab atlanta],$60.93,50,High
[free alcohol rehab centers],$11.88,110,High
[christian alcohol rehab centers],–,70,High
[alcohol rehab las vegas],$33.40,70,High
[alcohol rehab cost],$57.37,110,High
some lines from text file:
san diego
dallas
atlanta
dallas
los angeles
denver
Upvotes: 1
Views: 95
Reputation: 180512
I think you can use a set
and indexing:
with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
open('cities2.txt', 'rb') as cities, \
open('drug_rehab_city_state.csv', 'wb') as out_csv:
writer = csv.writer(out_csv, delimiter = ",")
space = ""
reader = csv.reader(csv_f)
# make set of all city names, lookups are 0(1)
city_set = {line.rstrip() for line in cities}
output_list = []
header = next(reader) # skip header
for row in reader:
try:
# names are either first or last with two words preceding or following
# so split twice on whitespace from either direction
if row[0].split(None,2)[-1].rstrip("]") in city_set or row[0].rsplit(None, 2)[0][1:] in city_set:
output_list.append(row)
except IndexError as e:
print(e,row[0])
writer.writerows(output_list)
Running time is now 0(n)
as opposed to quadratic.
Upvotes: 2
Reputation: 846
First, as @Shawn Zhang suggests (r'\b{0}\b').format(c.strip())
can be outside loop, and you can create result list, to avoid writing to file in each iteration.
Second, you might try re.compile
to compile regular expression, that might improve your performance on regular expression.
Third, try to profile it a little bit to find the bottleneck, e.g. with timeit
or other profiler like ica
if you have SciPy.
Also, if city is always in first column, and I assume that it's named 'City' why don't you use csv.DictReader()
to read csv? I'm sure it's faster than regular expression.
As you provided example of your file I get rid of re
(because it seems you really don't need them), and got that more than 10 time faster with code as below:
import csv
with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
open('cities2.txt', 'rb') as cities, \
open('drug_rehab_city_state.csv', 'wb') as out_csv:
writer = csv.writer(out_csv, delimiter = ",")
output_list = []
reader = csv.reader(csv_f)
city_lst = cities.readlines()
for row in reader:
for city in city_lst:
city = city.strip()
if city in row[0]:
output_list.append(row)
writer.writerows(output_list)
Upvotes: 2
Reputation: 74675
Build a single regex with all the city names:
city_re = re.compile(r'\b('+ '|'.join(c.strip() for c in cities.readlines()) + r')\b')
and then do:
for row in reader:
match = city_re.search(row[0])
if match:
writer.writerow(row)
This will reduce the number of loop iterations from 18895 x 145 to only 18895 with the regex engine doing its best at string prefix matching on those 145 city names.
For your convenience and testing, here is the full listing:
import csv
import re
with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
open('cities2.txt', 'rb') as cities, \
open('drug_rehab_city_state.csv', 'wb') as out_csv:
writer = csv.writer(out_csv, delimiter = ",")
reader = csv.reader(csv_f)
city_re = re.compile(r'\b('+ '|'.join(c.strip() for c in cities.readlines()) + r')\b')
for row in reader:
match = city_re.search(row[0])
if match:
writer.writerow(row)
Upvotes: 2
Reputation: 3170
Please note that I assume that you can use a better way than using re.search
for finding city in row, since generally the city will be separated by a delimiter like space. Otherwise it is a complexity greater than O(n*m)
One way could be to use a hashtable.
ht = [0]*MAX
Read all the cities (assuming these are in thousands) and fill up a hashtable
ht[hash(city)] = 1
Now when you iterate across each row in reader,
for row in reader:
for word in row:
if ht[hash(word)] == 1:
# found, do stuff here
pass
Upvotes: 1
Reputation: 1884
Even though I don't think the loop/IO is big bottleneck,but still if you can try to start with them.
There two tips I could provide:
(r'\b{0}\b').format(c.strip())
can be in outside the loop ,that will increase some performance, because we don't have to strip(), format on in each loop.
also ,you don't have to write output result in each loop, instead, you can create a result list ouput_list
save the result during the loop and write them once after the loop.
import csv
import re
import datetime
start = datetime.datetime.now()
with open('alcohol_rehab_ltp.csv', 'rb') as csv_f, \
open('cities2.txt', 'rb') as cities, \
open('drug_rehab_city_state.csv', 'wb') as out_csv:
writer = csv.writer(out_csv, delimiter = ",")
space = ""
reader = csv.reader(csv_f)
city_lst = [(r'\b{0}\b').format(c.strip()) for c in cities.readlines()]
output_list = []
for row in reader:
for city in city_lst:
#city = city.strip()
match = re.search(city, row[0])
if match:
output_list.append(row)
break
writer.writerows(output_list)
end = datetime.datetime.now()
print end - start
Upvotes: 1