Reputation: 173
I am trying to use BeautifulSoup to locate a Gliffy diagram on an HTML page. The source code of the HTML page looks roughly like the following:
<p>Lorem ipsum dolor sit amet</p>
<p>Figure: Consectetur adipiscing elit</p>
<p>
<ac:structured-macro ac:macro-id="a9ab423b-b68c-4836-bffa-cdf1c5b95392" ac:name="gliffy" ac:schema-version="1">
<ac:parameter ac:name="displayName">Sed do eiusmod</ac:parameter>
<ac:parameter ac:name="name">Tempor incididunt ut</ac:parameter>
<ac:parameter ac:name="pagePin">2</ac:parameter>
</ac:structured-macro>
</p>
<p><br/></p>
I want to locate the <ac:structured-macro ac:macro-id="a9ab423b-b68c-4836-bffa-cdf1c5b95392" ac:name="gliffy" ac:schema-version="1">
in the page, but not using such a general statement like soup.find_all('ac:structured-macro')
because there are many kinds of macros used in Confluence, I want to accurately locate the ac:name="gliffy"
macro excluding all other possibilities.
However, as this does not look like a standard HTML tag. I am not sure BeautifulSoup is the correct choice. Should I be using other libraries like lxml. Anyway, please let me know what library I should be using and what function, and how I should call to accurately locate the Gliffy diagram in this HTML page. Thank you.
Upvotes: 1
Views: 356
Reputation: 5005
For xml
data you can still use BeautifulSoup
but you need to donwnload the lxml
parser, not in the standard library.
pip install lxml
Here an example on how finding code could look like:
from bs4 import BeautifulSoup
html = """<p>Lorem ipsum dolor sit amet</p>
<p>Figure: Consectetur adipiscing elit</p>
<p>
<ac:structured-macro ac:macro-id="a9ab423b-b68c-4836-bffa-cdf1c5b95392" ac:name="gliffy" ac:schema-version="1">
<ac:parameter ac:name="displayName">Sed do eiusmod</ac:parameter>
<ac:parameter ac:name="name">Tempor incididunt ut</ac:parameter>
<ac:parameter ac:name="pagePin">2</ac:parameter>
</ac:structured-macro>
</p>
<p><br/></p>"""
soup = BeautifulSoup(html, "lxml")
for tag in soup.find_all(attrs={"ac:name": "gliffy"}):
print(tag)
Upvotes: 2