Reputation: 177
I am using Beautiful Soup to identify a specific tag and its contents. The contents are html-links and I want to extract the text of these tags.
The problem is that the text is made up of different numbers according to a specific pattern. I am only interested in number such as "61993J0417" and "61991CJ0316" and I need the regexp to match both when the number has a "J" and "CJ" in the middle.
I have used this code to achieve this:
soup.find_all(text=re.compile('[6][1-2][0-9]{3}[J]|[CJ][0-9]{4}'))
The soup variable is the contents of the specific tag. This code works in 9 out of 10 cases. However, when I run this script on one of my source files, it also matches numbers such as "51987PC0716".
I cannot understand why so I turn to you for assistance.
Upvotes: 3
Views: 3234
Reputation: 61467
You haven't specified what the |
applies to; by default it's the entire regex, meaning you have asked for either
[6][1-2][0-9]{3}[J]
(which is the same thing as 6[12][0-9]{3}J
) or
CJ[0-9]{4}
(not [CJ]
, which means "either C or J"). Use parentheses to specify what the alternatives are:
^6[12][0-9]{3}(J|CJ)[0-9]{4}$
which is better written
^6[12][0-9]{3}C?J[0-9]{4}$
Upvotes: 3
Reputation: 20621
IIUC, you always have a "J" inside your string. Therefore, make it obligatory, and make the "C" optional, using a question mark. Something like:
re.compile('6[1-2][0-9]{3}C?J[0-9]{4}')
I have not tested this, but you probably can continue from here by yourself.
Upvotes: 3