Reputation: 12747
I'm trying to write my first parser with BeautifulSoup (BS4) and hitting a conceptual issue, I think. I haven't done much with Python -- I'm much better at PHP.
I can get BeautifulSoup to find the table I want, but when I try to step into the table and find all the rows, I get some variation on:
AttributeError: 'ResultSet' object has no attribute 'attr'
I tried walking through the sample code at How do I draw out specific data from an opened url in Python using urllib2? and got more or less the same error (note: if you want to try it you'll need a working URL.)
Some of what I'm reading says that the issue is that the ResultSet is a list. How would I know that? If I do print type(table)
it just tells me <class 'bs4.element.ResultSet'>
I can find text in the table with:
for row in table:
text = ''.join(row.findAll(text=True))
print text
but if I try to search for HTML with:
for row in table:
text = ''.join(row.find_all('tr'))
print text
It complains about expected string, Tag found
So how do I wrangle this string (which is a string full of HTML) back into a beautifulsoup object that I can parse?
Upvotes: 1
Views: 5735
Reputation: 7233
BeautifulSoup data-types are bizarre to say the least. A lot of times they don't give enough information to easily piece together the puzzle. I know your pain! Anyway...on to my answer...
Its hard to provide a completely accurate example without seeing more of your code, or knowing the actual site you're attempting to scrape, but I'll do my best.
The problem is your ''.join()
. .findAll('tr')
returns a list of elements of the BeautifulSoup datatype 'tag'. Its how BS knows to find tr
s. Because of this, you're passing the wrong datatype to your ''.join()
.
You should code one more iteration. (I'm assuming there are td
tags withing the tr
s)
text_list = []
for row in table:
table_row = row('tr')
for table_data in table_row:
td = table_data('td')
for td_contents in td:
content = td_contents.contents[0]
text_list.append(content)
text = ' '.join(str(x) for x in text_list)
This returns the entire table content into a single string. You can refine the value of text
by simply changing the locations of text_list
and text =
.
This probably looks like more code than is required, and that might be true, but I've found my scrapes to be much more thorough and accurate when I go about it this way.
Upvotes: 3