Reputation: 167
I have some HTML:
<td class="course-section-type"><span class="text-capitalize">lecture (5)</span></td>
<td class="course-section-meeting">
<table class="no-borders" width="100%">
<tbody>
<tr>
<td width="23%">MWF</td>
<td width="55%">11:30 AM - 12:20 PM</td>
<td width="22%"><span><a href="http://myurl.com" target="_blank">MGH</a> <span class="sr-only">building room</span> 389</span></td>
</tr>
</tbody>
</table>
</td>
<td class="course-section-sln">00000</td>
I'd like to extract the values of top-level "class" attributes and map them to a list of lower level text. For the above HTML, that would look something like:
data = {
"course-section-type": ["lecture (5)"],
"course-section-meeting": ["MWF", "11:30 AM - 12:20 PM", "MGH", "building room", "389"],
"course-section-sln": ["00000"]
}
I know that I can extract all the text with soup.findAll('td').text
, but I don't know how to traverse the html tree nor how to extract the value of a tag attribute. How would I go about doing this?
Any help is appreciated.
Upvotes: 1
Views: 795
Reputation: 167
Figured it out. Turns out BeautifulSoup provides a keyword argument findAll(text=True)
that finds all the text under a certain tag (using inorder traversal) and puts it in a list.
d = {}
for tag in line.findAll('td'):
if tag.get("class") and "course" in tag.get("class")[0]:
d[tag.get("class")[0]] = [text.strip() for text in tag.findAll(text=True)]
>>> d
{"course-section-type": ["lecture (5)"],
"course-section-meeting": ["MWF", "11:30 AM - 12:20 PM", "MGH", "building room",
"389"], "course-section-sln": ["00000"]}
Upvotes: 2
Reputation: 90
solution is extract everything in this pattern,
cause its table in table, so the schema has to be fixed, otherwise nexttime when it changes, everything breaks again
course-section-type is outer table first <td>
text
course-section-meeting is inner table everything text
course-section-sln is outer table third <td>
text
Upvotes: 0