Reputation: 441
I extracted 1 long string from a webpage. Using:
x=re.findall(r"(?:l'article)\s\d+\w+.*;", xpath)
It extracted the following 2 strings:
l'article 1382 du code civil ;
l'article 700 du code de procédure civile, les condamne à payer à la société Financière du cèdre la somme globale de 3 000 euros et rejette leurs demandes ;
However, the latter one is a bit long. All I need is up to the ','. is there a way to do this directly ? So have my original regex command look for either the ';' or the ',' based on which one it encounters first.
If not, can I apply regex to a list, or do I need to write a loop for that ?
Required outcome a list with:
l'article 1382 du code civil
l'article 700 du code de procédure civile
Note, I have to apply this to many pages and there might be many more of these in a page. Doing anything by hand or by specifically indicating an entry in a list is not possible.
Upvotes: 2
Views: 123
Reputation: 8413
You can simplify your regex a lot:
(?:l'article)
-> there is no need for the non-capturing group, so you could just remove it\s\d+\w+
-> the check for \w+
seems rather pointless (especially as this matches numbers without letters), so I think you could remove it. Or you are missing a space character to match e.g. 1382 du
.*;
to match anything up to ,
or ;
you can simply use a negated character class, like [^;,]*
which will match everything that's not one of those.So your final regex could be either
l'article\s\d+[^;,]*
or
l'article\s\d+\s\w+[^;,]*
Upvotes: 2
Reputation: 14321
A couple things you seem to be missing the ungreedy operator, ?
in order to force the regex to stop searching after it find the first occurrence. Additionally, you can check for multiple characters by using []
(refer to the following). Here would be the new code:
(?:l'article)\s\d+\w+.*?[;,]
Regex101:
https://regex101.com/r/tYkNHK/1
Upvotes: 3