humanbeing
humanbeing

Reputation: 1697

Get all HTML tags with Beautiful Soup

I am trying to get a list of all html tags from beautiful soup.

I see find all but I have to know the name of the tag before I search.

If there is text like

html = """<div>something</div>
<div>something else</div>
<div class='magical'>hi there</div>
<p>ok</p>"""

How would I get a list like

list_of_tags = ["<div>", "<div>", "<div class='magical'>", "<p>"]

I know how to do this with regex, but am trying to learn BS4

Upvotes: 44

Views: 96820

Answers (5)

Amar Kumar
Amar Kumar

Reputation: 2646

If you want to find some specific HTML tags then try this:

html = driver.page_source
# driver.page_source: "<div>something</div>\n<div>something else</div>\n<div class='magical'>hi there</div>\n<p>ok</p>\n"
soup = BeautifulSoup(html)
for tag in soup.find_all(['a','div']):  # Mention HTML tag names here.
    print(tag.text)

# Result:
# something
# something else
# hi there

Upvotes: 3

Jason R Stevens CFA
Jason R Stevens CFA

Reputation: 3091

I thought I'd share my solution to a very similar question for those that find themselves here, later.

Example

I needed to find all tags quickly but only wanted unique values. I'll use the Python calendar module to demonstrate.

We'll generate an html calendar then parse it, finding all and only those unique tags present.

The below structure is very similar to the above, using set comprehensions:

from bs4 import BeautifulSoup
import calendar

html_cal = calendar.HTMLCalendar().formatmonth(2020, 1)
set(tag.name for tag in BeautifulSoup(html_cal, 'html.parser').find_all())

# Result
# {'table', 'td', 'th', 'tr'}

Upvotes: 4

alecxe
alecxe

Reputation: 473763

You don't have to specify any arguments to find_all() - in this case, BeautifulSoup would find you every tag in the tree, recursively.

Sample:

from bs4 import BeautifulSoup

html = """<div>something</div>
<div>something else</div>
<div class='magical'>hi there</div>
<p>ok</p>
"""
soup = BeautifulSoup(html, "html.parser")

print([tag.name for tag in soup.find_all()])
# ['div', 'div', 'div', 'p']

print([str(tag) for tag in soup.find_all()])
# ['<div>something</div>', '<div>something else</div>', '<div class="magical">hi there</div>', '<p>ok</p>']

Upvotes: 72

Belkacem Thiziri
Belkacem Thiziri

Reputation: 665

Here is an efficient function that I use to parse different HTML and text documents:

def parse_docs(path, format, tags):
    """
    Parse the different files in path, having html or txt format, and extract the text content.
    Returns a list of strings, where every string is a text document content.
    :param path: str
    :param format: str
    :param tags: list
    :return: list
    """

    docs = []
    if format == "html":
        for document in tqdm(get_list_of_files(path)):
            # print(document)
            soup = BeautifulSoup(open(document, encoding='utf-8').read())
            text = '\n'.join([''.join(s.findAll(text=True)) for s in
                              soup.findAll(tags)])  # parse all <p>, <div>, and <h> tags
            docs.append(text)
    else:
        for document in tqdm(get_list_of_files(path)):
            text = open(document, encoding='utf-8').read()
            docs.append(text)
    return docs

a simple call: parse_docs('/path/to/folder', 'html', ['p', 'h', 'div']) will return a list of text strings.

Upvotes: 0

Anjan
Anjan

Reputation: 71

Please try the below--

for tag in soup.findAll(True):
    print(tag.name)

Upvotes: 7

Related Questions