Reputation: 1044
I have two a pages that I would like to scrap: url_1 and url_2
The only difference between them it's that url_1
is the first page while url_2
is the third page of the same domain.
I am using urrlib
to read the urls:
from urllib.request import urlopen
html_1 = urlopen(url_1).read()
html_2 = urlopen(url_2).read()
Unfortunately html_2
has the same content as html_1
.
Reading around, I found out that maybe this is happening because the server sees me as a bot. For that reason, I am using the request
module the Beautiful Soup
to parse the pages:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5)AppleWebKit 537.36 (KHTML, like Gecko) Chrome", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"}
req_1 = session.get(url_1, headers=headers)
bsObj_1 = BeautifulSoup(req_1.text)
req_2 = session.get(url_2, headers=headers)
bsObj_2 = BeautifulSoup(req_2.text)
Still the content is the same. How can I fix it?
Upvotes: 1
Views: 66
Reputation: 8077
Try this:
import requests
from bs4 import BeautifulSoup
import time
url_1 = 'https://www.zoekscholen.onderwijsinspectie.nl/zoek-en-vergelijk?searchtype=generic&zoekterm=&pagina=&filterSectoren=BVE'
url_2 = 'https://www.zoekscholen.onderwijsinspectie.nl/zoek-en-vergelijk?searchtype=generic&zoekterm=&pagina=3&filterSectoren=BVE'
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5)AppleWebKit 537.36 (KHTML, like Gecko) Chrome",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"}
with requests.Session() as s:
s.headers.update(headers)
s.get('https://www.zoekscholen.onderwijsinspectie.nl/')
req_1 = s.get(url_1)
soup1 = BeautifulSoup(req_1.text, "lxml")
print(soup1.find("div", {"id": "mainResults"}).find_all("h2")[0].text)
time.sleep(1)
req_2 = s.get(url_2)
soup2 = BeautifulSoup(req_2.text, "lxml")
print(soup2.find("div", {"id": "mainResults"}).find_all("h2")[0].text)
Outputs:
Resultaten 1 - 20 van 165
Resultaten 41 - 60 van 165
Upvotes: 1