Reputation: 125
I am trying to scrape multiple pages using beautifulsoup concept, but am getting only the last page results as output, please suggest the right way. Below is my code.
# For every page
for page in range(0,8):
# Make a get request
response = get('http://nationalacademyhr.org/fellowsdirectory?page=0%2C{}' + format(page))
# Pause the loop
sleep(randint(8,15))
# Monitor the requests
requests += 1
elapsed_time = time() - start_time
print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
clear_output(wait = True)
html_soup = BeautifulSoup(response.text, 'html.parser')
all_table_info = html_soup.find('table', class_ = "views-table cols-4")
for name in all_table_info.find_all('div',
class_="views-field views-field-view"):
names.append(name.text.replace("\n", " ")if name.text else None)
for organization in all_table_info.find_all('td',
class_="views-field views-field-field-employer"):
orgs.append(organization.text.strip() if organization.text else None)
for year in all_table_info.find_all('td',
class_ = "views-field views-field-view-2"):
Years.append(year.text.strip() if year.text else None)
df = pd.DataFrame({'Name' : names, 'Org' : orgs, 'year' : Years })
print (df)
Upvotes: 0
Views: 153
Reputation: 87134
Note: there are 9 pages on the site identified by page=0,0
through to page=0,8
. Your loop should use range(9)
. Or, even better, load the first page then get the URL for the next page using the next
link. Iterate over all the page by following the next
link until there is no next
link on the page.
Further to xhancar's answer which identifies the problem, a better way is to avoid string operations when building URLs, and instead let requests
construct the URL query string for you:
for page in range(9):
params = {'page': '0,{}'.format(page)}
response = get('http://nationalacademyhr.org/fellowsdirectory', params=params)
The params
parameter is passed to requests.get()
which adds the values to the URL query string. The query parameters will be properly encoded, e.g. the ,
replaced with %2C
.
Upvotes: 0
Reputation: 797
There is a typing error: a plus instead of a dot. You need 'http://nati...ge=0%2C{}'.format(page)
,
but you wrote
'http://nati...ge=0%2C{}' + format(page)
URLs having braces before the page number end up at the same page.
EDIT:
If I was not clear, you need just change the line
response = get('http://nationalacademyhr.org/fellowsdirectory?page=0%2C{}' + format(page))
to
response = get('http://nationalacademyhr.org/fellowsdirectory?page=0%2C{}'.format(page))
In the first case the resulting URL contains also the substring '{}', which causes the problem.
Upvotes: 2