Jeeta
Jeeta

Reputation: 73

Regular Expressions match a block from multiline html text

I have a few html files with two different patterns of a piece of code, where only name="horizon" is constant. I need to get the value of an attribute named as "value". Below are the sample files:-
File1:

<tag1> data
</tag1>
<select size="1" name="horizon">
    <option value="Admin">Admin Users</option>
    <option value="Remote Admin">Remote Admin</option>
</select>

File2:

<othertag some_att="asfa"> data
</othertag>
<select id="realm_17" size="1" name="horizon">
    <option id="option_LoginPage_1" value="Admin Users">Admin Users</option>
    <option id="option_LoginPage_1" value="Global-User">Global-User</option>
</select>

Since the files will have other tags and attributes, I tried writing regular expressions by referring this to filter the required content from the files with these regular expressions.

regex='^(?:.*?)(<(?P<TAG>\w+).+name\=\"horizon\"(?:.*[\n|\r\n?]*)+?<\/(?P=TAG>)'

I have tried this with re.MULTILINE and re.DOTALL but could not get desired text.
I suppose, I would be able to find the required names as list by using re.findall('value\=\"(.*)\",text) once I get the required text.
Please suggest if there is any elegant way to handle the situation.

Upvotes: 0

Views: 554

Answers (3)

Jeeta
Jeeta

Reputation: 73

I tried the xml.etree.ElementTree module as explained by @kazbeel but it gave me error of "mismatched tag", which I found is the case in most instances of its usage. Then I found this BeautifulSoup module and used it, and it gave the desired results. The following code has covered another file pattern along with the above ones from the question.
File3:

<input id="realm_90" type="hidden" name="horizon" value="RADIUS">

Code:

from bs4 import BeautifulSoup ## module for parsing xml/html files
def get_realms(html_text):
    realms=[]
    soup=BeautifulSoup(html_text, 'lxml')
    in_tag=soup.find(attrs={"name":"horizon"})
    if in_tag.name == 'select':
        for tag in in_tag.find_all():
            realms.append(tag.attrs['value'])
    elif in_tag.name == 'input':
        realms.append(in_tag.attrs['value'])
    return realms

I agree with @ZiTAL to not to use regular expressions when parsing xml/html files because it gets too complicated and there are number of libraries present for them.

Upvotes: 0

Usman
Usman

Reputation: 2029

Try this regex !

value="(.*)">

This is simple regex for extracting the value from your html files . This regex shows that extract anything between double quotes & after "value=" & before ">" .

I am also attach the screenshot of the output !

Output

Upvotes: 0

kazbeel
kazbeel

Reputation: 1436

I completely agree @ZiTAL when saying that parsing the files as XML would be much faster and nicer.

A few simple lines of code would solve your problem:

import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()

# If you prefer to parse the text directly do root = ET.fromstring('<root>example</root>')

values = [el.attrib['value'] for el in root.findall('.//option')]

print(values)

Upvotes: 2

Related Questions