Reputation: 929
I want to run over my HTML source code, and extract all the tags and text there, but without their children.
for example this HTML:
<html>
<head>
<title>title</title>
</head>
<body>
Hello world
</body>
</html>
When I tried to call soup.find_all()
or soup.descendants
, my return value was:
<html><head><title>title</title></head><body>Hello world</body></html>
<head><title>title</title></head>
<title>title</title>
title
<body>Hello world</body>
Hello World
When what I'm looking is every tag seperated, without his descendants:
<html>
<head>
<title>
title
<body>
Hello World
How can I do that?
Upvotes: 1
Views: 46
Reputation: 473893
The idea would be to iterate over all nodes. For those with no children elements, get the text:
for elm in soup(): # soup() is equivalent to soup.find_all()
if not elm(): # elm() is equivalent to elm.find_all()
print(elm.name, elm.get_text(strip=True))
else:
print(elm.name)
Prints:
html
head
title title
body Hello world
Upvotes: 2