Maxim Postnikov
Maxim Postnikov

Reputation: 109

Bug in python bs4 analyzer classes?

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import re
doc = "<div>Text text <span class='_ _3'>empty</span> text</div>"
soup = BeautifulSoup(doc)
for span in soup.find_all('span' , class_=re.compile("_\s_[0-9]+")) :
    span.decompose()

Need to find all the tags with the <span class=_ _\d+> and remove from DOM. But this piece of code not working for some reason!

Upvotes: 2

Views: 139

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1122232

BeautifulSoup splits out classes for you into a list; the regular expression won't match on multiple classes. class is one of a set of such attributes, see Multi-valued attributes.

You'll have to use a custom function to filter on multiple classes using regular expressions:

def underscored_class_span(elem, numbered=re.compile(u'_\d').match):
    if elem.name != 'span': return False
    classes = elem.get('class', [])
    return u'_' in classes and any(numbered(c) for c in classes)

for span in soup.find_all(underscored_class_span):
    span.decompose()

Upvotes: 2

Related Questions