user3314418
user3314418

Reputation: 3041

One Way to Grab both text and percentages regex python distinguish numbers from letters within parantheses

I have this:

Bbc World News (57%); DANONE SA (FRANCE) (52%), Mn-Public-Radio-Intl; SIC123 Industry (52%)

I'd like to get:

[BBC World News, 57], [DANONE SA (FRANCE), 52], [Mn-Public-Radio Intl, 0], [SIC123 Industry, 52]

With the following helpfully suggested by Martijn Pieters, i can get everything besides DANONE SA (FRANCE). I'm not sure how to distinguish between (FRANCE) and (52%).

pat = r'(?(\b[\w\s\d!////&,:.%#@$-]+\b)(?:\s+\((\d+)%\))?'
[(name, int(perc) if perc else np.nan)
 for name, perc in re.findall(pat, inputtext)]

Upvotes: 0

Views: 120

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1122512

You can include the () characters in the character class, but then it'll match the first characters of the percentage text (so (57 in the case of Bbc World News (57%). To make this all work still, you need to do a look-ahead to match on the trailing , or ; or the end of the string:

re.findall(r'(\b[\w() -]+)(?:\s+\((\d+)%\))?(?=[,;]|$)', inputtext)

The (?=...) is a look-ahead match; that section is now anchored to any location that is followed either by a character matching the [,;] class or the end of a line. That makes the part before it, matching an optional (..%) percentage amount, only work before a comma or semicolon or the end of the text, and that then limits what the part before can match.

Demo:

>>> import re
>>> import numpy as np
>>> inputtext = 'Bbc World News (57%); DANONE SA (FRANCE) (52%), Mn-Public-Radio-Intl; SIC123 Industry (52%)'
>>> [(name, int(perc) if perc else np.nan)
...  for name, perc in re.findall(r'(\b[\w() -]+)(?:\s+\((\d+)%\))?(?=[,;]|$)', inputtext)]
[('Bbc World News', 57), ('DANONE SA (FRANCE)', 52), ('Mn-Public-Radio-Intl', nan), ('SIC123 Industry', 52)]

Upvotes: 2

Related Questions