Martin S.
Martin S.

Reputation: 13

RegEx: Python (findall). Order of elements in OR statement resulting in different output

I am trying to get my head around regular expressions and was playing with some examples trying to see what it comes out at. I am struggling to understand how the order of element in OR (|) impacts the output of the following code


import re 
uni = "University of Sheffield" 
first = re.findall(".*U|S.*U|S",uni) 
second = re.findall(".*U|S.*S|U",uni)
third = re.findall(".*S|U.*S|U",uni)

if I print first, second and third variable I get following

first -> ['U', 'S']
second -> ['U']
third -> ['University of S']

I don't understand why the output for each is the way it is. I assumed it should be the same and it should be ['University of S']. I was wondering if someone would help me understand why is it interpreted differently for each of these 3 cases?

Thank you!

Upvotes: -1

Views: 69

Answers (2)

Finn E
Finn E

Reputation: 388

It has to do with the order of operations involving OR (|).

By default, OR takes everything either side of it, so your 3 expressions would be as follows:
.*U OR S.*U OR S
.*U OR S.*S OR U
.*S OR U.*S OR U

This means that for the first one, your code does find anything/nothing followed by a U (.*U). It does not find an S followed by anything/nothing followed by a U (S.*U). Then it does find an S (S). Hence the result, ["U", "S"]

Similarly, for the second expression, your code does find anything/nothing followed by a U (.*U). It does not find an S followed by anything/nothing followed by an S (S.*S). Then it does not find a second U (U). Hence the result, ["U"]

For the third expression, your code does not find anything/nothing followed by an S (.*S). Then it does find a U followed by anything/nothing ('niversity of ') followed by an S (U.*S). Then it does not find another U. Hence the result ["University of S"].

I assume you meant your expression to be:
.* (U OR S) .* (U OR S)

To write this as valid regex, it should be:

.*(?U|S).*(?U|S)

You can also do it with match groups (...) instead of non-matching groups (?...).

However, best practice in this case would be that you use a character class. It is written with square brackets, and matches any one of all the characters put inside. To use it in this example, it would be:

.*[US].*[US]

Upvotes: 1

Mark Tolonen
Mark Tolonen

Reputation: 177901

an OR(|) stops on the first match, not the longest. Once a match is made, only the remaining string is checked during .findall:

In the first case, .*U or S.*U or S is checked:

  • .*U matches U.
  • In the remaining string (niversity of Sheffiel), S matches S.
  • In the remaining string (heffield), none of the patterns match.

In the second case, .*U or S.*S or U is checked:

  • .*U matches U.
  • In the remaining string (niversity of Sheffield), none of patterns match.

In the third case, .*S or U.*S or U is checked:

  • .*S matches University of S.
  • In the remaining string (heffield), none of the patterns match.

Upvotes: 0

Related Questions