Reputation: 73
I have a few html files with two different patterns of a piece of code, where only name="horizon"
is constant. I need to get the value of an attribute named as "value". Below are the sample files:-
File1:
<tag1> data
</tag1>
<select size="1" name="horizon">
<option value="Admin">Admin Users</option>
<option value="Remote Admin">Remote Admin</option>
</select>
File2:
<othertag some_att="asfa"> data
</othertag>
<select id="realm_17" size="1" name="horizon">
<option id="option_LoginPage_1" value="Admin Users">Admin Users</option>
<option id="option_LoginPage_1" value="Global-User">Global-User</option>
</select>
Since the files will have other tags and attributes, I tried writing regular expressions by referring this to filter the required content from the files with these regular expressions.
regex='^(?:.*?)(<(?P<TAG>\w+).+name\=\"horizon\"(?:.*[\n|\r\n?]*)+?<\/(?P=TAG>)'
I have tried this with re.MULTILINE
and re.DOTALL
but could not get desired text.
I suppose, I would be able to find the required names as list by using re.findall('value\=\"(.*)\",text)
once I get the required text.
Please suggest if there is any elegant way to handle the situation.
Upvotes: 0
Views: 554
Reputation: 73
I tried the xml.etree.ElementTree
module as explained by @kazbeel but it gave me error of "mismatched tag", which I found is the case in most instances of its usage. Then I found this BeautifulSoup module and used it, and it gave the desired results. The following code has covered another file pattern along with the above ones from the question.
File3:
<input id="realm_90" type="hidden" name="horizon" value="RADIUS">
Code:
from bs4 import BeautifulSoup ## module for parsing xml/html files
def get_realms(html_text):
realms=[]
soup=BeautifulSoup(html_text, 'lxml')
in_tag=soup.find(attrs={"name":"horizon"})
if in_tag.name == 'select':
for tag in in_tag.find_all():
realms.append(tag.attrs['value'])
elif in_tag.name == 'input':
realms.append(in_tag.attrs['value'])
return realms
I agree with @ZiTAL to not to use regular expressions when parsing xml/html files because it gets too complicated and there are number of libraries present for them.
Upvotes: 0
Reputation: 2029
Try this regex !
value="(.*)">
This is simple regex for extracting the value from your html files . This regex shows that extract anything between double quotes & after "value=" & before ">" .
I am also attach the screenshot of the output !
Upvotes: 0
Reputation: 1436
I completely agree @ZiTAL when saying that parsing the files as XML would be much faster and nicer.
A few simple lines of code would solve your problem:
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
# If you prefer to parse the text directly do root = ET.fromstring('<root>example</root>')
values = [el.attrib['value'] for el in root.findall('.//option')]
print(values)
Upvotes: 2