Peter
Peter

Reputation: 441

Substring with two possibilities regex

I extracted 1 long string from a webpage. Using:

 x=re.findall(r"(?:l'article)\s\d+\w+.*;", xpath)

It extracted the following 2 strings:

 l'article 1382 du code civil ;
 l'article 700 du code de procédure civile, les condamne à payer à la société Financière du cèdre la somme globale de 3 000 euros et rejette leurs demandes ;

However, the latter one is a bit long. All I need is up to the ','. is there a way to do this directly ? So have my original regex command look for either the ';' or the ',' based on which one it encounters first.

If not, can I apply regex to a list, or do I need to write a loop for that ?

Required outcome a list with:

 l'article 1382 du code civil
 l'article 700 du code de procédure civile

Note, I have to apply this to many pages and there might be many more of these in a page. Doing anything by hand or by specifically indicating an entry in a list is not possible.

Upvotes: 2

Views: 123

Answers (2)

Sebastian Proske
Sebastian Proske

Reputation: 8413

You can simplify your regex a lot:

  • (?:l'article) -> there is no need for the non-capturing group, so you could just remove it
  • \s\d+\w+ -> the check for \w+ seems rather pointless (especially as this matches numbers without letters), so I think you could remove it. Or you are missing a space character to match e.g. 1382 du
  • .*; to match anything up to , or ; you can simply use a negated character class, like [^;,]* which will match everything that's not one of those.

So your final regex could be either

l'article\s\d+[^;,]*

or

l'article\s\d+\s\w+[^;,]*

Upvotes: 2

Neil
Neil

Reputation: 14321

A couple things you seem to be missing the ungreedy operator, ? in order to force the regex to stop searching after it find the first occurrence. Additionally, you can check for multiple characters by using [] (refer to the following). Here would be the new code:

(?:l'article)\s\d+\w+.*?[;,]

Regex101:

https://regex101.com/r/tYkNHK/1

Upvotes: 3

Related Questions