Pattern Matching Tags with RegEx and Python (re.findall)

Question

I need to match and apture the information between the pairs of tags. There are 2 pairs of tags per line. A pair of tags is like this:

  hello hello 123 stuff to ignore here 123412bhje what??? stuff to ignore here asd13asf who! Hooooo! stuff to ignore here df7887a

The expected output is:

hello hello 123 123412bhje 
what??? asd13asf 
who! Hooooo! df7887a

I need to specifically use the format:

M = re.findall(“”, linein)

ScottC · Accepted Answer

In order to ignore the first tag, the regex had to make the assumption that the first character inside of the tag did not contain a space, but the space was allowed thereafter.

Here are the other assumptions made:

tag letters are in lowercase. eg
information between tag-pairs can only contain uppercase letters, lowercase letters, numbers, and the symbols ! and ?. If there are other symbols within the tags, then it may not match accurately.

Here is a working version based on your example:

import re

linein = '  hello hello 123 stuff to ignore here 123412bhje what??? stuff to ignore here asd13asf who! Hooooo! stuff to ignore here df7887a'
M = re.findall(r'<[a-z]+>([A-Za-z0-9?!][[A-Za-z0-9?!\s]*)', linein)

for i in range(0,len(M),2):
    print(M[i],M[i+1])

OUTPUT:

hello hello 123 123412bhje
what??? asd13asf
who! Hooooo! df7887a

Pattern Matching Tags with RegEx and Python (re.findall)

Answers (1)

Related Questions