Reputation: 352
I am having a problem with web-scraping. I am trying to learn how to do it, but I can't seem to get past some of the basics. I am getting an error, "TypeError: 'ResultSet' object is not callable" is the error I'm getting.
I've tried a number of different things. I was originally trying to use the "find" instead of "find_all" function, but I was having an issue with beautifulsoup pulling in a nonetype. I was unable to create an if loop that could overcome that exception, so I tried using the "find_all" instead.
page = requests.get('https://topworkplaces.com/publication/ocregister/')
soup = BeautifulSoup(page.text,'html.parser')all_company_list =
soup.find_all(class_='sortable-table')
#all_company_list = soup.find(class_='sortable-table')
company_name_list_items = all_company_list('td')
for company_name in company_name_list_items:
#print(company_name.prettify())
companies = company_name.content[0]
I'd like this to pull in all the companies in Orange County California that are on this list in a clean manner. As you can see, I've already accomplished pulling them in, but I want the list to be clean.
Upvotes: 1
Views: 1118
Reputation: 8047
You've got the right idea. I think instead of immediately finding all the <td>
tags (which is going to return one <td>
for each row (140 rows) and each column in the row (4 columns)), if you want only the company names, it might be easier to find all the rows (<tr>
tags) then append however many columns you want by iterating the <td>
s in each row.
This will get the first column, the company names:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://topworkplaces.com/publication/ocregister/')
soup = BeautifulSoup(page.text,'html.parser')
all_company_list = soup.find_all('tr')
company_list = [c.find('td').text for c in all_company_list[1::]]
Now company_list
contains all 140 company names:
>>> print(len(company_list))
['Advanced Behavioral Health', 'Advanced Management Company & R³ Construction Services, Inc.',
...
, 'Wes-Tec, Inc', 'Western Resources Title Company', 'Wunderman', 'Ytel, Inc.', 'Zillow Group']
Change c.find('td')
to c.find_all('td')
and iterate that list to get all the columns for each company.
Upvotes: 1
Reputation: 84465
Pandas:
Pandas is often useful here. The page uses multiple sorts including company size, rank. I show rank sort.
import pandas as pd
table = pd.read_html('https://topworkplaces.com/publication/ocregister/')[0]
table.columns = table.iloc[0]
table = table[1:]
table.Rank = pd.to_numeric(table.Rank)
rank_sort_table = table.sort_values(by='Rank', axis=0, ascending = True)
rank_sort_table.reset_index(inplace=True, drop=True)
rank_sort_table.columns.names = ['Index']
print(rank_sort_table)
Depending on your sort, companies in order:
print(rank_sort_table.Company)
Requests:
Incidentally, you can use nth-of-type to select just first column (company names) and use id, rather than class name, to identify the table as faster
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://topworkplaces.com/publication/ocregister/')
soup = bs(r.content, 'lxml')
names = [item.text for item in soup.select('#twpRegionalList td:nth-of-type(1)')]
print(names)
Note the default sorting is alphabetical on name column rather than rank.
Reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html
Upvotes: 1