bs4 findAll not finding class tags

Question

I'm trying to parse through a table, and I am using bs4. When I use the find_all with a specific class tag, nothing is returned. However, when I do not specify the class, it returns something. i.e, this returns the table and all of the td elements

from bs4 import BeautifulSoup as soup

page_soup = soup(html, 'html.parser')

stat_table = page_soup.find_all('table')
stat_table = stat_table[0]

with open ('stats.txt','w', encoding = 'utf-8') as q:
for row in stat_table.find_all('tr'):
    for cell in row.find_all('td'):
        q.write(cell.text.strip().ljust(18))

If I try to use this:

page_soup = soup(html, 'html.parser')

stat_table = page_soup.find_all('table')
stat_table = stat_table[0]

with open ('stats.txt','w', encoding = 'utf-8') as q:
 for row in stat_table.find_all('tr'):
    for cell in row.find_all('td',{'class':'blah'}):
        q.write(cell.text.strip().ljust(18))

this code should return a specific td element with the specified class, but nothing is returned. Any help would be greatly appreciated.

abarnert · Accepted Answer

The class attribute isn't a normal string, but a multi-valued attribute.¹

For example:

>>> text = "spam"
>>> soup = BeautifulSoup(text, 'html.parser')
>>> soup.span['class']
['a', 'b', 'c']

To search for a multi-valued attribute, you should pass multiple values:

>>> soup.find('span', class_=('a', 'b', 'c'))
spam

Notice that, even though BeautifulSoup is presenting the values as a list, they actually act more like a set—you can pass the same values in any order, and duplicates are ignored:

>>> soup.find('span', class_={'a', 'b', 'c'})
spam
>>> soup.find('span', class_=('c', 'b', 'a', 'a'))
spam

You can also search on a multi-valued attribute with a string, which will find any elements whose attribute includes that string as one of its values:

>>> soup.find('span', class_='c')
spam

But if you pass a string with whitespace… as far as I can tell, what it does isn't actually documented, but what happens in practice is that it will match exactly the (arbitrary) way the string is handed to BeautifulSoup by the parser.

As you can see above, even though the HTML had 'a b c ' in it, BeautifulSoup has turned it into 'a b c'—stripping whitespace off the ends, and turning any internal runs of whitespace into single spaces. So, that's what you have to search for:

>>> soup.find('span', class_='a  b c ')
>>> soup.find('span', class_='a b c')
spam

But, again, you're better off searching with a sequence or set of separate values than trying to guess how to put them together into a string that happens to work.

So, what you want to do is probably:

for cell in row.find_all('td', {'class': ('column-even', 'table-column-even', 'ft_enrolled')}):

Or, maybe you don't want to think in DOM terms but in CSS-selector terms:

>>> soup.select('span.a.b.c')
[spam]

Notice that CSS also doesn't care about the order of the classes, or about duplicates:

>>> soup.select('span.c.a.b.c')
[spam]

Also, this allows you to search for a subset of the classes, rather than just one or all of them:

>>> soup.select('span.c.b')
[spam]

_{1. This is a change from Beautiful Soup 3. Which shouldn't even need to be mentioned, as BS3 has been dead for nearly a decade, and doesn't run on Python 3.x or, in some cases, even 2.7. But people keep copying and pasting old BS3 code into blog posts and Stack Overflow answers, so other people keep getting surprised that the code they found online doesn't actually work. If that's what happened here, you need to learn to spot BS3 code so you can ignore it and look elsewhere.}

bs4 findAll not finding class tags

Answers (1)

Related Questions