Reputation: 35
I have a csv database of tweets, which I need to search for a list of specific phrases and words. For example, I'm searching for "global warming". I want to find not only "global warming", but also "Global warming", "Global Warming", "#globalwarming", "#Globalwarming", "#GlobalWarming", etc. So, all the possible forms.
How could I implement regex into my code to do that? Or maybe there's another solution?
with open('filedirectory.csv', 'w', newline='') as output_file:
writer = csv.writer(output_file)
with open('filedirectory1.csv', 'w', newline='') as output_file2:
writer2 = csv.writer(output_file2)
with open('filedirectory2.csv') as csv_file:
csv_read = csv.reader(csv_file)
for row in csv_read:
search_terms = ["global warming", "GLOBAL WARMING", etc.]
if any([term in row[2] for term in search_terms]):
writer.writerow(row)
else:
writer2.writerow(row) ``
Upvotes: 1
Views: 305
Reputation: 444
You can use your own code with very simple modification
...
for row in csv_read:
row_lower = row.lower()
search_terms = ["global warming", "globalwarming"]
if any([term in row_lower for term in search_terms]):
writer.writerow(row)
else:
writer2.writerow(row)
If you must use regex or you are afraid to miss some rows such as : "...global(more than one space)warming...", "..global____warming..", "..global serious warming.."
...
global_regex = re.compile(r'global.*?warming', re.IGNORECASE)
for row in csv_read:
if any(re.findall(global_regex, row)):
writer.writerow(row)
else:
writer2.writerow(row)
I compiled the regex outside the loop for better performance.
Here you can see the regex in action.
Upvotes: 1