RedFox
RedFox

Reputation: 141

Get all tags of a HTML with bs4

I want to be able to get all tags of a HTML file, say:

<html>
<body>
<something>
</something>
</body>
</html>

I want this to return something like: ['html', 'body', 'something'] While bs4 is able to get all instances of a tag, I'm yet to find anything that can return all tags. This is the code I wrote to return a clean output.

with open('nameofhtm.html') as f:
    soup = BeautifulSoup(f, 'lxml')     
    print(soup.prettify())

Output:

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   nothing
  </title>
  <link href="None" rel="shortcut icon"/>
  <link href="style.css" rel="stylesheet"/>
  <header>
   nothing more
  </header>
  <something>
  </something>
 </head>
</html>

Is there a way? Thanks in advance

Upvotes: 0

Views: 536

Answers (1)

joni
joni

Reputation: 7157

You could use a filter function and extract all the tag names:

soup = BeautifulSoup(your_html)
tag_names = [tag.name for tag in soup.find_all(lambda tag: tag is not None)]

One could just as well use soup.find_all(name=True) to search for all tags with any tag name, i.e.

soup = BeautifulSoup(your_html)
tag_names = [tag.name for tag in soup.find_all(name=True)]

which is equivalent to the filter function.

Upvotes: 2

Related Questions