guskenny83
guskenny83

Reputation: 1373

regex in python translation

im just getting to grips with how regex works in python but some of the syntax is kind of throwing me a bit.

how would you translate the following regex into a regex that can be used by the re module in python?

a(b|c)*a

it doesnt matter what the symbols are, i am more asking about the brackets and operators, how they work.

if i was to be specific about my situation, i am trying to capture all text from between two angle brackets. according to some resources that i have read, the "." character matches any character except newline, and "s" matches any whitespace, including newline, so i thought the way to do it would be:

<[.|s]*>

but evidently i was wrong.

i am interested in a solution for my specific problem, but any general information on the operators in python regex would be appreciated also.

EDIT:

after more experimenting it seems to work when i use:

<.*>

when i have text like

<foo bar>

but not for when i have

<foo
bar>

however when i try

<[\n.]*>

nothing works. and so i thought it might be the brackets doing it or something so i tried:

<[.]*>

and that didnt even work like <.*> .. but surely, the two are the same except for the brackets..

anyone have any ideas? i'd like to be able to capture all text like:

<foo
bar>

Upvotes: 0

Views: 691

Answers (2)

rob_wheeler
rob_wheeler

Reputation: 66

The python regular expression syntax is clearly documented here:

https://docs.python.org/2/library/re.html

For your particular case, I'd try something like:

import re
pat = re.compile('<([^>]*)>')
match = pat.search('Foo <bar> bam')
print match.groups()
# should print ('bar',)

To understand the regular expression, we can break it down into its component parts:

  • < - match the left-angle bracket
  • ( - start of a group
  • [^>]* - match 0 or more characters (*) in the class ([^>]). A character class ([]) that starts with a caret (^) means match characters that are not part of the class. In this case the class consists of the single character, right-angle bracket (>).
  • ) - end the group
  • > - match the right-angle bracket

Upvotes: 3

user3489112
user3489112

Reputation: 91

a(b|c)*a is directly usable as a Python re. <[.|s]*> is a confused mess. [...] is a character range: | has no business inside. s does not denote a space in Python regular expressions; instead \s does. Maybe you are confusing |s with \s here (but it would make more sense to use just \n here and/or use the respective flags to have . also match a newline).

Upvotes: 0

Related Questions