DACW
DACW

Reputation: 2821

Regex capture numbers based on preceding text

Consider the following text:

one="ambience: 5 comments:xxx food: 4 comments: xxxx service: 3 
comments: xxx" 

two="ambience: 5 comments:xxx food:   comments: since nothing to eat
after 8 pm service: 4  comments: xxxx "

three="ambience: it is a 5 comments:xxx food: a 6   comments: since nothing to eat
after 8 pm service: a 4  comments: xxxx "

for string one

    re.findall(ur'(ambience|food|service)[\s\S]*?(\d)',one,re.UNICODE)
    [('ambience', '5'), ('food', '4'), ('service', '3')]

for string two the result is

[('ambience', '5'), ('food', '8'), ('service', '4')]

since this logic purely looks for the first digit after the specific text it is fairly misleading when rating is skipped intentionally or otherwise .

If the consecutive rating is missed how do i get regex return the rating as NaN ?

[('ambience', '5'), ('food', 'NaN'), ('service', '4')]

I also have a variant using look-ahead and look-behind anchors

re.findall(ur'(?<=food)[\s]*:[^\d]*([\d[.|-|\/|-]+)[^\d]*(?=comment[s]*[\s]*:)',one,re.UNICODE)

Upvotes: 0

Views: 60

Answers (1)

nu11p01n73R
nu11p01n73R

Reputation: 26667

A simple change in regex would do the trick

(ambience|food|service):[^\d:]*(\d*)
  • [^\d:]* matches anything other than a : or digit

Example matching http://regex101.com/r/bM0gT2/1

Example usage

>>> re.findall(r'(ambience|food|service):[^\d:]*(\d*)', one)
[('ambience', '5'), ('food', '4'), ('service', '3')]
>>> re.findall(r'(ambience|food|service):[^\d:]*(\d*)', two)
[('ambience', '5'), ('food', ''), ('service', '4')]
>>> re.findall(r'(ambience|food|service):[^\d:]*(\d*)', three)
[('ambience', '5'), ('food', '6'), ('service', '4')]

Upvotes: 1

Related Questions