Reputation: 465
I am using BeautifulSoup4 on a MacOSX running Python 2.7.8. I am having difficulty extracting information from the following html code
<tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<td headers="yui-dt0-th-rank" class="rank yui-dt0-col-rank"></td>
</tr>
<tr id="yui-rec1" class="yui-dt-odd">...</tr>
<tr id="yui-rec2" class="yui-dt-even">...</tr>
</tbody>
I can't seem to grab the table or any of it's contents because BS and/or python doesn't seem to recognize values with hyphens. So the usual code, something like
Table = soup.find('tbody',{'class':'yui-dt-data'})
or
Row2 = Table.find('tr',{'id':'yui-rec2'})
just returns an empty object (not NONE, simply empty). I'm not new to BS4 or Python and I've extracted information from this site before, but the class names are different now than when I previously did it. Now everything has hyphens. Is there any way to get Python to recognize the hyphen or a workaround?
I need to have my code be general so that I can run it across numerous pages that all have the same class name. Unfortunately, the id
attribute in <tbody>
is unique to that particular table, so I can't use that to identify this table across webpages.
Any help would be appreciated. Thanks in advance.
Upvotes: 5
Views: 5050
Reputation: 84455
Just use select
. bs4 4.7.1
import requests
from bs4 import BeautifulSoup as bs
html = '''
<tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<td headers="yui-dt0-th-rank" class="rank yui-dt0-col-rank"></td>
</tr>
<tr id="yui-rec1" class="yui-dt-odd">...</tr>
<tr id="yui-rec2" class="yui-dt-even">...</tr>
</tbody>
'''
soup = bs(html, 'lxml')
soup.select('.yui-dt-data')
Upvotes: 1
Reputation: 61
For people trying to find a solution to find a tag with hyphen in its attributes, there is an answer in the document https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments
This segment of code will cause error
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
you should do this
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]
Upvotes: 6
Reputation: 576
The following code:
from bs4 import BeautifulSoup
htmlstring = """ <tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<tr id="yui-rec1" class="yui-dt-odd">
<tr id="yui-rec2" class="yui-dt-even">"""
soup = BeautifulSoup(htmlstring)
Table = soup.find('tbody', attrs={'class': 'yui-dt-data'})
print("Table:\n")
print(Table)
tr = Table.find('tr', attrs={'class': 'yui-dt-odd'})
print("tr:\n")
print(tr)
outputs:
Table:
<tbody class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650" tabindex="0">
<tr class="yui-dt-first yui-dt-even" id="yui-rec0">
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2"></tr></tr></tr></tbody>
tr:
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2"></tr></tr>
Even though the html you supplied isn't by itself valid, it seems that BS is making a guess about how it should be, because soup.prettify()
yields
<tbody class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650" tabindex="0">
<tr class="yui-dt-first yui-dt-even" id="yui-rec0">
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2">
</tr>
</tr>
</tr>
</tbody>
Though I'm guessing those tr's aren't supposed to be nested.
Could you try running that exact code and seeing what the output is?
Upvotes: 6