K. Sanjay
K. Sanjay

Reputation: 125

Why does my program only output the last page of a multiple page scraping operation?

I am trying to scrape multiple pages using beautifulsoup concept, but am getting only the last page results as output, please suggest the right way. Below is my code.

# For every page 

for page in range(0,8):
    # Make a get request
    response = get('http://nationalacademyhr.org/fellowsdirectory?page=0%2C{}' + format(page))
    # Pause the loop
    sleep(randint(8,15))
     # Monitor the requests
    requests += 1
    elapsed_time = time() - start_time
    print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
    clear_output(wait = True)

    html_soup = BeautifulSoup(response.text, 'html.parser')
    all_table_info = html_soup.find('table', class_ = "views-table cols-4")


    for name in all_table_info.find_all('div', 
           class_="views-field views-field-view"):
    names.append(name.text.replace("\n", " ")if name.text else None)


    for organization in all_table_info.find_all('td', 
           class_="views-field views-field-field-employer"):
    orgs.append(organization.text.strip() if organization.text else None)


    for year in all_table_info.find_all('td', 
           class_ = "views-field views-field-view-2"):
    Years.append(year.text.strip() if year.text else None)


    df = pd.DataFrame({'Name' : names, 'Org' : orgs, 'year' : Years })

    print (df)

Upvotes: 0

Views: 153

Answers (2)

mhawke
mhawke

Reputation: 87134

Note: there are 9 pages on the site identified by page=0,0 through to page=0,8. Your loop should use range(9). Or, even better, load the first page then get the URL for the next page using the next link. Iterate over all the page by following the next link until there is no next link on the page.


Further to xhancar's answer which identifies the problem, a better way is to avoid string operations when building URLs, and instead let requests construct the URL query string for you:

for page in range(9):
    params = {'page': '0,{}'.format(page)}
    response = get('http://nationalacademyhr.org/fellowsdirectory', params=params)

The params parameter is passed to requests.get() which adds the values to the URL query string. The query parameters will be properly encoded, e.g. the , replaced with %2C.

Upvotes: 0

hancar
hancar

Reputation: 797

There is a typing error: a plus instead of a dot. You need 'http://nati...ge=0%2C{}'.format(page), but you wrote 'http://nati...ge=0%2C{}' + format(page)

URLs having braces before the page number end up at the same page.

EDIT:

If I was not clear, you need just change the line response = get('http://nationalacademyhr.org/fellowsdirectory?page=0%2C{}' + format(page)) to response = get('http://nationalacademyhr.org/fellowsdirectory?page=0%2C{}'.format(page))

In the first case the resulting URL contains also the substring '{}', which causes the problem.

Upvotes: 2

Related Questions