Reputation: 1318
I am writing a Python script that is passing over upwards of 20GB of news article data. For every single line that is a "date" (once every 100 lines or so), I need to check if the title of that article is financial. The title is in the form:
SOME BIG NEWS HAPPENED TO CISCO
My code loops through every company name in the S&P 500 (which I have cached in a set
), and tries to see if the title matches.
line = "SOME BIG NEWS HAPPENED TO CISCO"
for company in company_names:
pattern = re.compile("(\\b" + company_name + "\\b)", flags=re.IGNORECASE)
if re.search(pattern, line):
do_something()
I copied over a mere 100,000 lines to a separate file to test my program, and it took 347 seconds. At this rate, it won't get through all of my data for upwards of a week.
I am trying to figure out how it could possibly take so long to loop through my file. Is the problem that Python
is unable to cache all of the compiled DFA's and needs to instead construct ~500 each time I encounter a new article?
Or is there another problem with my current regular expression that would cause such a long execution time?
Any help would be greatly appreciated.
Upvotes: 2
Views: 409
Reputation: 104102
You might try holding pre compiled patterns in a dict. Something like:
companies=('Cisco', 'Apple', 'IBM', 'GE')
patterns={co:re.compile("(\\b" + co + "\\b)", flags=re.IGNORECASE) for co in companies}
line = "SOME BIG NEWS HAPPENED TO CISCO"
for co, pat in patterns.items():
if re.search(pat, line):
print "'{}' found in: '{}'".format(co, line)
Or, you might try Python's string methods:
words=line.lower().split()
for co in [e.lower() for e in companies]:
if co in words:
print "'{}' found in: '{}'".format(co, line)
Note that doing a [e.strip(',.!:;') for e in line.lower().split()]
on the line is nearly equivalent to using word boundaries and case insensitive in a regex. (Or use TigerhawkT3's ''.join(filter(str.isalpha, line.lower())).split(): do_something()
)
You can also use a set intersection to get common words:
>>> line2="Apple acquires Cisco: Generally a good thing"
>>> set(e.lower() for e in companies) & set(e.strip(',.!:;') for e in line2.lower().split())
set(['cisco', 'apple'])
Upvotes: 3