ChinaMahjongKing
ChinaMahjongKing

Reputation: 173

Use BeautifulSoup to Find a Custom HTML Tag

I am trying to use BeautifulSoup to locate a Gliffy diagram on an HTML page. The source code of the HTML page looks roughly like the following:

<p>Lorem ipsum dolor sit amet</p>
<p>Figure: Consectetur adipiscing elit</p>
<p>
   <ac:structured-macro ac:macro-id="a9ab423b-b68c-4836-bffa-cdf1c5b95392" ac:name="gliffy" ac:schema-version="1">
      <ac:parameter ac:name="displayName">Sed do eiusmod</ac:parameter>
      <ac:parameter ac:name="name">Tempor incididunt ut</ac:parameter>
      <ac:parameter ac:name="pagePin">2</ac:parameter>
   </ac:structured-macro>
</p>
<p><br/></p>

I want to locate the <ac:structured-macro ac:macro-id="a9ab423b-b68c-4836-bffa-cdf1c5b95392" ac:name="gliffy" ac:schema-version="1"> in the page, but not using such a general statement like soup.find_all('ac:structured-macro') because there are many kinds of macros used in Confluence, I want to accurately locate the ac:name="gliffy" macro excluding all other possibilities.

However, as this does not look like a standard HTML tag. I am not sure BeautifulSoup is the correct choice. Should I be using other libraries like lxml. Anyway, please let me know what library I should be using and what function, and how I should call to accurately locate the Gliffy diagram in this HTML page. Thank you.

Upvotes: 1

Views: 356

Answers (1)

cards
cards

Reputation: 5005

For xml data you can still use BeautifulSoup but you need to donwnload the lxml parser, not in the standard library.

pip install lxml

Here an example on how finding code could look like:

from bs4 import BeautifulSoup

html = """<p>Lorem ipsum dolor sit amet</p>
<p>Figure: Consectetur adipiscing elit</p>
<p>
    <ac:structured-macro ac:macro-id="a9ab423b-b68c-4836-bffa-cdf1c5b95392" ac:name="gliffy" ac:schema-version="1">
    <ac:parameter ac:name="displayName">Sed do eiusmod</ac:parameter>
    <ac:parameter ac:name="name">Tempor incididunt ut</ac:parameter>
    <ac:parameter ac:name="pagePin">2</ac:parameter>
    </ac:structured-macro>
</p>
<p><br/></p>"""


soup = BeautifulSoup(html, "lxml")

for tag in soup.find_all(attrs={"ac:name": "gliffy"}):
   print(tag)

Upvotes: 2

Related Questions