Marie Ducourau
Marie Ducourau

Reputation: 93

How to avoid errors when just one page URL- Python - Web scraping

Hello Everyone, I am beginner and I try to use IF ELSE function with url link in web scraping. I want to select all the pages from de department 64 to 66. My url is : http://www.pour-les-personnes-agees.gouv.fr/annuaire-accueil-de-jour/{}/0 (with {} = 64 or 65 or 66). My loop works and select all my pages for 64. But when I am inside the 65 I saw I have only one page so my code line last_page = soup.find('ul', class_='pagination').find('li', class_='next').a['href'].split('=')[1] cannot work. here my code :

import requests
from bs4 import BeautifulSoup
url_list = ['http://www.pour-les-personnes-agees.gouv.fr/annuaire-accueil-de-jour/{}/0']
for link in url_list:
    r=requests.get(link)
    soup = BeautifulSoup(r.content, "html.parser")
    page_Url_test=[link.format(i) for i in range(64, 66)]
    for depart_page in page_Url_test:
        depart_page1=str(depart_page)+"?page={}"
        r=requests.get(depart_page1)
        soup = BeautifulSoup(r.content, "html.parser")
        last_page = soup.find('ul', class_='pagination').find('li', class_='next').a['href'].split('=')[1]
        dept_page_Url=[depart_page1.format(i) for i in range(0, int(last_page)+1)]
print(dept_page_Url)

I tried to incorporate an IF ELSE like this:

for depart_page in page_Url_test:
    depart_page1=str(depart_page)+"?page={}"
    r=requests.get(depart_page1)
    soup = BeautifulSoup(r.content, "html.parser")
    if len(depart_page1) == 0 :
        dept_page_Url=depart_page1
    else:
        last_page = soup.find('ul', class_='pagination').find('li', class_='next').a['href'].split('=')[1]
        dept_page_Url=[depart_page1.format(i) for i in range(0, int(last_page)+1)]
print(dept_page_Url)

But It doesn't work. How can I say to my code: If I have just one page select just the first one else do my next step? Any clue ? I don't have enough knowledge to find alone... Thank you a lot

Upvotes: 0

Views: 56

Answers (1)

SIM
SIM

Reputation: 22440

As sir t.m.adam has already pointed out, you can try like the below approach. I also have trimmed your code to make it concise.

import requests
from bs4 import BeautifulSoup

url_list = 'http://www.pour-les-personnes-agees.gouv.fr/annuaire-accueil-de-jour/{}/0'
for link in [url_list.format(page) for page in range(64,67)]:
    res = requests.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    depart_page = str(link) + "?page={}"
    if soup.find('ul', class_='pagination'):
        last_page = soup.find('ul', class_='pagination').find('li', class_='next').a['href'].split('=')[1]
        dept_page_Url = [depart_page.format(i) for i in range(0, int(last_page)+1)]
        print(dept_page_Url)

Additional approach when in need:

if soup.find('ul', class_='pagination'):
    last_page = soup.find('ul', class_='pagination').find('li', class_='next').a['href'].split('=')[1]
    dept_page_Url = [depart_page.format(i) for i in range(0, int(last_page)+1)]
    print(dept_page_Url)
else:   
    print(link)

Result:

['http://www.pour-les-personnes-agees.gouv.fr/annuaire-accueil-de-jour/64/0?page=0', 'http://www.pour-les-personnes-agees.gouv.fr/annuaire-accueil-de-jour/64/0?page=1', 'http://www.pour-les-personnes-agees.gouv.fr/annuaire-accueil-de-jour/64/0?page=2']
['http://www.pour-les-personnes-agees.gouv.fr/annuaire-accueil-de-jour/66/0?page=0', 'http://www.pour-les-personnes-agees.gouv.fr/annuaire-accueil-de-jour/66/0?page=1', 'http://www.pour-les-personnes-agees.gouv.fr/annuaire-accueil-de-jour/66/0?page=2']

Upvotes: 1

Related Questions