Reputation: 109
I'm trying to parse through a table, and I am using bs4. When I use the find_all with a specific class tag, nothing is returned. However, when I do not specify the class, it returns something. i.e, this returns the table and all of the td elements
from bs4 import BeautifulSoup as soup
page_soup = soup(html, 'html.parser')
stat_table = page_soup.find_all('table')
stat_table = stat_table[0]
with open ('stats.txt','w', encoding = 'utf-8') as q:
for row in stat_table.find_all('tr'):
for cell in row.find_all('td'):
q.write(cell.text.strip().ljust(18))
If I try to use this:
page_soup = soup(html, 'html.parser')
stat_table = page_soup.find_all('table')
stat_table = stat_table[0]
with open ('stats.txt','w', encoding = 'utf-8') as q:
for row in stat_table.find_all('tr'):
for cell in row.find_all('td',{'class':'blah'}):
q.write(cell.text.strip().ljust(18))
this code should return a specific td element with the specified class, but nothing is returned. Any help would be greatly appreciated.
Upvotes: 2
Views: 2387
Reputation: 365767
The class
attribute isn't a normal string, but a multi-valued attribute.1
For example:
>>> text = "<div><span class='a b c '>spam</span></div>"
>>> soup = BeautifulSoup(text, 'html.parser')
>>> soup.span['class']
['a', 'b', 'c']
To search for a multi-valued attribute, you should pass multiple values:
>>> soup.find('span', class_=('a', 'b', 'c'))
<span class="a b c">spam</span>
Notice that, even though BeautifulSoup is presenting the values as a list, they actually act more like a set—you can pass the same values in any order, and duplicates are ignored:
>>> soup.find('span', class_={'a', 'b', 'c'})
<span class="a b c">spam</span>
>>> soup.find('span', class_=('c', 'b', 'a', 'a'))
<span class="a b c">spam</span>
You can also search on a multi-valued attribute with a string, which will find any elements whose attribute includes that string as one of its values:
>>> soup.find('span', class_='c')
<span class="a b c">spam</span>
But if you pass a string with whitespace… as far as I can tell, what it does isn't actually documented, but what happens in practice is that it will match exactly the (arbitrary) way the string is handed to BeautifulSoup by the parser.
As you can see above, even though the HTML had 'a b c '
in it, BeautifulSoup has turned it into 'a b c'
—stripping whitespace off the ends, and turning any internal runs of whitespace into single spaces. So, that's what you have to search for:
>>> soup.find('span', class_='a b c ')
>>> soup.find('span', class_='a b c')
<span class="a b c">spam</span>
But, again, you're better off searching with a sequence or set of separate values than trying to guess how to put them together into a string that happens to work.
So, what you want to do is probably:
for cell in row.find_all('td', {'class': ('column-even', 'table-column-even', 'ft_enrolled')}):
Or, maybe you don't want to think in DOM terms but in CSS-selector terms:
>>> soup.select('span.a.b.c')
[<span class="a b c">spam</span>]
Notice that CSS also doesn't care about the order of the classes, or about duplicates:
>>> soup.select('span.c.a.b.c')
[<span class="a b c">spam</span>]
Also, this allows you to search for a subset of the classes, rather than just one or all of them:
>>> soup.select('span.c.b')
[<span class="a b c">spam</span>]
1. This is a change from Beautiful Soup 3. Which shouldn't even need to be mentioned, as BS3 has been dead for nearly a decade, and doesn't run on Python 3.x or, in some cases, even 2.7. But people keep copying and pasting old BS3 code into blog posts and Stack Overflow answers, so other people keep getting surprised that the code they found online doesn't actually work. If that's what happened here, you need to learn to spot BS3 code so you can ignore it and look elsewhere.
Upvotes: 1