List of all element names in HTML document — beautifulsoup

Question

I want to get a list containing all different tag names of a HTML document (a list of string of tag names without repetition). I tried putting empty entry with soup.findall(), but this gave me the entire document instead.

Is there a way of doing it?

user4396006 · Accepted Answer

Using soup.findall() you get a list of every single element you can iterate over. Therefore you can do the following:

from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...
"""  # an html sample
soup = BeautifulSoup(html_doc, 'html.parser')

document = soup.html.find_all()

el = ['html',]  # we already include the html tag
for n in document:
    if n.name not in el:
        el.append(n.name)

print(el)

The output of the code snippet would be:

>>> ['head', 'title', 'body', 'p', 'b', 'a']

Edit

As @PM 2Ring Pointed out there, if you don't care about the order in which the elements are added (which as he says I don't think it is the case), then you may use sets. In Python 3.x you don't have to import it, but if you use an older version you may want to check whether it is supported.

from bs4 import BeautifulSoup

...

el = {x.name for x in document} # use a set comprehension to generate it easily
el.add("html")  # only if you need to

List of all element names in HTML document — beautifulsoup

Answers (1)

Edit

Related Questions