Reputation: 129
Take some rudimentary HTML like this as an example. How could one remove all children nodes past say 2 nodes deep before it truncates and removes it.
<html>
<head>
<title></title>
<meta />
<meta />
<link />
</head>
<body>
<div>
<div>
<a></a>
<a></a>
<a></a>
</div>
<span>
<h1>
<li></li>
<li></li>
</h1>
</span>
</div>
</body>
would become something like:
<html>
<head>
<title></title>
<meta />
<meta />
<link />
</head>
<body>
<div>
<div></div>
<span></span>
</div>
</body>
Upvotes: 2
Views: 3285
Reputation: 474071
The idea is to iterate over all elements recursively and count down the parents:
from bs4 import BeautifulSoup
from urllib2 import urlopen
data = """your html goes here"""
depth = 5
soup = BeautifulSoup(data)
for tag in soup.find_all():
if len(list(tag.parents)) == depth:
tag.extract()
print soup.prettify()
prints:
<html>
<head>
<title>
</title>
<meta/>
<meta/>
<link/>
</head>
<body>
<div>
<div></div>
<span></span>
</div>
</body>
</html>
Upvotes: 2
Reputation: 675
Maybe something like:
for child in body.children:
for element in child.children:
element.clear()
Upvotes: 1