Reputation: 109
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import re
doc = "<div>Text text <span class='_ _3'>empty</span> text</div>"
soup = BeautifulSoup(doc)
for span in soup.find_all('span' , class_=re.compile("_\s_[0-9]+")) :
span.decompose()
Need to find all the tags with the <span class=_ _\d+>
and remove from DOM. But this piece of code not working for some reason!
Upvotes: 2
Views: 139
Reputation: 1122232
BeautifulSoup splits out classes for you into a list; the regular expression won't match on multiple classes. class
is one of a set of such attributes, see Multi-valued attributes.
You'll have to use a custom function to filter on multiple classes using regular expressions:
def underscored_class_span(elem, numbered=re.compile(u'_\d').match):
if elem.name != 'span': return False
classes = elem.get('class', [])
return u'_' in classes and any(numbered(c) for c in classes)
for span in soup.find_all(underscored_class_span):
span.decompose()
Upvotes: 2