Bug in python bs4 analyzer classes?

Question

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import re
doc = "Text text empty text"
soup = BeautifulSoup(doc)
for span in soup.find_all('span' , class_=re.compile("_\s_[0-9]+")) :
    span.decompose()

Need to find all the tags with the and remove from DOM. But this piece of code not working for some reason!

Martijn Pieters · Accepted Answer

BeautifulSoup splits out classes for you into a list; the regular expression won't match on multiple classes. class is one of a set of such attributes, see Multi-valued attributes.

You'll have to use a custom function to filter on multiple classes using regular expressions:

def underscored_class_span(elem, numbered=re.compile(u'_\d').match):
    if elem.name != 'span': return False
    classes = elem.get('class', [])
    return u'_' in classes and any(numbered(c) for c in classes)

for span in soup.find_all(underscored_class_span):
    span.decompose()

Bug in python bs4 analyzer classes?

Answers (1)

Related Questions