Dan
Dan

Reputation: 929

How to get all the tags in HTML code with BeautifulSoup, without their children?

I want to run over my HTML source code, and extract all the tags and text there, but without their children.

for example this HTML:

<html>
<head>
<title>title</title>
</head>
<body>
Hello world
</body>
</html>

When I tried to call soup.find_all() or soup.descendants, my return value was:

<html><head><title>title</title></head><body>Hello world</body></html>
<head><title>title</title></head>
<title>title</title>
title
<body>Hello world</body>
Hello World

When what I'm looking is every tag seperated, without his descendants:

<html>
<head>
<title>
title
<body>
Hello World

How can I do that?

Upvotes: 1

Views: 46

Answers (1)

alecxe
alecxe

Reputation: 473893

The idea would be to iterate over all nodes. For those with no children elements, get the text:

for elm in soup():  # soup() is equivalent to soup.find_all()
    if not elm():  # elm() is equivalent to elm.find_all()
        print(elm.name, elm.get_text(strip=True))
    else:
        print(elm.name)

Prints:

html
head
title title
body Hello world

Upvotes: 2

Related Questions