Reputation: 588
I have a list of keywords and I want to parse through a list of long strings for the keyword, any mention of a price in currency format and any other number in the string less than 10. For example:
keywords = ['Turin', 'Milan' , 'Nevada']
strings = ['This is a sentence about Turin with 5 and $10.00 in it.', ' 2.5 Milan is a city with £1,000 in it.', 'Nevada and $1,100,000. and 10.09']]
would hopefully return the following:
final_list = [('Turin', '$10.00', '5'), ('Milan', '£1,000', '2.5'), ('Nevada', '$1,100,000', '')]
I've got the following function with functioning regexes but I don't know how to combine the outputs into a list of tuples. Is there an easier way to achieve this? Should I split by word then look for matches?
def find_keyword_comments(list_of_strings,keywords_a):
list_of_tuples = []
for string in list_of_strings:
keywords = '|'.join(keywords_a)
keyword_rx = re.findall(r"^\b({})\b$".format(keywords), string, re.I)
price_rx = re.findall(r'^[\$\£\€]\s?\d{1,3}(?:[.,]\d{3})*(?:[.,]\d{1,2})?$', string)
number_rx1 = re.findall(r'\b\d[.]\d{1,2}\b', string)
number_rx2 = re.findall(r'\s\d\s', string)
Upvotes: 0
Views: 379
Reputation: 71451
You can use re.findall
:
import re
keywords = ['Turin', 'Milan' , 'Nevada']
strings = ['This is a sentence about Turin with 5 and $10.00 in it.', '2.5 Milan is a city with £1,000 in it.', 'Nevada and $1,100,000. and 10.09']
grouped_strings = [(i, [b for b in strings if i in b]) for i in keywords]
new_groups = [(a, filter(lambda x:re.findall('\d', x),[re.findall('[\$\d\.£,]+', c) for c in b][0])) for a, b in grouped_strings]
last_groups = [(a, list(filter(lambda x:re.findall('\d', x) and float(x) < 10 if x[0].isdigit() else True, b))) for a, b in new_groups]
Output:
[('Turin', ['5', '$10.00']), ('Milan', ['2.5', '£1,000']), ('Nevada', ['$1,100,000.'])]
Upvotes: 3