Crumbo0
Crumbo0

Reputation: 49

Python Data Scraping - Extracting lines in the table where the tag '<td>' exists

I have been working on web scraping and have gotten pretty far in preparing my table from the web page I am scraping from.

The problem is that I can't get past getting the entries which only contain the data (lines which start with '< td >'). My code is as follows:

url = requests.get('https://en.wikipedia.org/wiki/Demographics_of_Toronto_neighbourhoods')

soup = BeautifulSoup(url.text,'lxml')
print(soup.prettify())

table_classes = {'class':'sortable'}
raw_table = soup.findAll("table", table_classes)
print(raw_table)

Putting the nest line of code causes the error 'ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()':

td_tags = raw_table.find_all('<td>')
td_tags

Looking at the data type I then tried to use find() and it still caused the same error, so I then tried looping over each line with the following code:

for line in raw_table:
    if line.get_text().find('<td>') > -1:
        line

When I run this loop, nothing happens. if I put it outside of the 'if' loop then it just returns every line in the table 'Canada_table_raw'

How can I get the entries with the '' tag so that I can then put the results into a pandas data frame?

Upvotes: 2

Views: 152

Answers (2)

QHarr
QHarr

Reputation: 84465

Why not use select and grab all the td elements.

data = [item.text for item in soup.select('.sortable td')]

Upvotes: 0

Edeki Okoh
Edeki Okoh

Reputation: 1844

You are missing one piece of code to get the parser to run.

url = requests.get(
    'https://en.wikipedia.org/wiki/Demographics_of_Toronto_neighbourhoods')

soup = BeautifulSoup(url.text, 'lxml')

table_classes = {'class': 'sortable'}
raw_table = soup.findAll("table", table_classes)
#print(raw_table)
for td in raw_table:
    print(td.findAll('td'))

As the error code says. You are returning a ResultSet Object. So you need to iterate over the object to get the specific elements that you need. In this case we are returning all of the td elements that are in the ResultsSet with the following output:

[<td><b>Toronto <a class="mw-redirect" href="/wiki/Census_metropolitan_area" title="Census metropolitan area">CMA</a> Average</b>
</td>, <td>
</td>, <td>All
</td>, <td><b>5,113,149</b>
</td>, <td><b>5903.63</b>
</td>, <td><b>866</b>
</td>, <td><b>9.0</b>
</td>, <td><b>40,704</b>
</td>, <td><b>10.6</b>
</td>, <td><b>11.4</b>
</td>, <td>
</td>, <td>
</td>, <td>
</td>, <td><a href="/wiki
........

Now you just need to decide what elements you are looking for and edit the td to get the results you want.

Upvotes: 1

Related Questions