Reputation: 650
I am trying to replace some elements (class: method) in a long html-website using .replaceWith. For that i use .descendants and iterate over them to check if the dl-element is what i am looking for. But that works only for 0<= X <= 2 elements which are next to each other. Every 3rd to n'th element in a row is "ignored". Executing the same code twice results in 4 replaced dl-elements in a row and so on.
for elem in matches:
for child in elem.descendants:
if not isinstance(child, NavigableString) and child.dl is not None and 'method' in child.dl.get('class'):
text = "<p>***removed something here***</p>"
child.dl.replaceWith(BeautifulSoup(text))
The (very silly) solution for that is to find the maximum of dl-elements in a row, divide it by two and execute that often. I would like to get a smart (and fast) solution for that and (even more important) understand whats going wrong here.
EDIT: html-website for testing is this one: https://docs.python.org/3/library/stdtypes.html and the error can be seen in chapter 4.7.1 string methods (a lot of methods available there)
EDIT_2: But i do not just use that html-website, but parts of it. The html-parts are stored in a list and and i just want dl-elements to be "removed" if they are not the first html-element (so i want to keep the element if it is the head).
All together this is how my code looks actually:
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup(open("/home/sven/Bachelorarbeit/python-doc-extractor-for-cado/extractor-application/index.html"))
f = open('test.html','w') #needs to exist
f.truncate
matches=[]
dl_elems = soup.find_all(['dl'], attrs={'class': ['class', 'method','function','describe', 'classmethod', 'staticmethod']}) # grab all possible dl-elements
sections = soup.find_all(['div'], attrs = {'class':'section'}) #grab all section-elements
matches = dl_elems + sections #merge the lists to get all results
for elem in matches:
for child in elem.descendants:
if not isinstance(child, NavigableString) and child.dl is not None and 'method' in child.dl.get('class'):
text = "<p>***removed something here***</p>"
child.dl.replaceWith(BeautifulSoup(text))
print(matches,file=f)
f.close()
Upvotes: 1
Views: 1492
Reputation: 473893
The idea is to find all dl
elements that has class="method"
and replace them with a p
tag:
import urllib2
from bs4 import BeautifulSoup, Tag
# get the html
url = "https://docs.python.org/3/library/stdtypes.html"
soup = BeautifulSoup(urllib2.urlopen(url))
# replace all `dl` elements with `method` class
for elem in soup('dl', class_='method'):
tag = Tag(name='p')
tag.string = '***removed something here***'
elem.replace_with(tag)
print soup.prettify()
UPD (adapted to the question edit):
dl_elems = soup.find_all(['dl'], attrs={'class': ['class', 'method','function','describe', 'classmethod', 'staticmethod']}) # grab all possible dl-elements
sections = soup.find_all(['div'], attrs={'class': 'section'}) #grab all section-elements
for parent in dl_elems + sections:
for elem in parent.find_all('dl', {'class': 'method'}):
tag = Tag(name='p')
tag.string = '***removed something here***'
elem.replace_with(tag)
print dl_elems + sections
Upvotes: 1