firelitte
firelitte

Reputation: 53

List of all element names in HTML document — beautifulsoup

I want to get a list containing all different tag names of a HTML document (a list of string of tag names without repetition). I tried putting empty entry with soup.findall(), but this gave me the entire document instead.

Is there a way of doing it?

Upvotes: 1

Views: 1962

Answers (1)

user4396006
user4396006

Reputation:

Using soup.findall() you get a list of every single element you can iterate over. Therefore you can do the following:

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""  # an html sample
soup = BeautifulSoup(html_doc, 'html.parser')

document = soup.html.find_all()

el = ['html',]  # we already include the html tag
for n in document:
    if n.name not in el:
        el.append(n.name)

print(el)


The output of the code snippet would be:

>>> ['head', 'title', 'body', 'p', 'b', 'a']


Edit

As @PM 2Ring Pointed out there, if you don't care about the order in which the elements are added (which as he says I don't think it is the case), then you may use sets. In Python 3.x you don't have to import it, but if you use an older version you may want to check whether it is supported.

from bs4 import BeautifulSoup

...

el = {x.name for x in document} # use a set comprehension to generate it easily
el.add("html")  # only if you need to

Upvotes: 5

Related Questions