Reputation: 2821
Consider the following text:
one="ambience: 5 comments:xxx food: 4 comments: xxxx service: 3
comments: xxx"
two="ambience: 5 comments:xxx food: comments: since nothing to eat
after 8 pm service: 4 comments: xxxx "
three="ambience: it is a 5 comments:xxx food: a 6 comments: since nothing to eat
after 8 pm service: a 4 comments: xxxx "
for string one
re.findall(ur'(ambience|food|service)[\s\S]*?(\d)',one,re.UNICODE)
[('ambience', '5'), ('food', '4'), ('service', '3')]
for string two the result is
[('ambience', '5'), ('food', '8'), ('service', '4')]
since this logic purely looks for the first digit after the specific text it is fairly misleading when rating is skipped intentionally or otherwise .
If the consecutive rating is missed how do i get regex return the rating as NaN ?
[('ambience', '5'), ('food', 'NaN'), ('service', '4')]
I also have a variant using look-ahead and look-behind anchors
re.findall(ur'(?<=food)[\s]*:[^\d]*([\d[.|-|\/|-]+)[^\d]*(?=comment[s]*[\s]*:)',one,re.UNICODE)
Upvotes: 0
Views: 60
Reputation: 26667
A simple change in regex would do the trick
(ambience|food|service):[^\d:]*(\d*)
[^\d:]*
matches anything other than a :
or digitExample matching http://regex101.com/r/bM0gT2/1
Example usage
>>> re.findall(r'(ambience|food|service):[^\d:]*(\d*)', one)
[('ambience', '5'), ('food', '4'), ('service', '3')]
>>> re.findall(r'(ambience|food|service):[^\d:]*(\d*)', two)
[('ambience', '5'), ('food', ''), ('service', '4')]
>>> re.findall(r'(ambience|food|service):[^\d:]*(\d*)', three)
[('ambience', '5'), ('food', '6'), ('service', '4')]
Upvotes: 1