Reputation: 2693
First, the code:
import requests
from bs4 import BeautifulSoup
url = 'https://stackoverflow.com/questions/tagged/python'
payload = {'pageSize': '5'}
r = requests.get(url, params=payload)
content = r.text
soup = BeautifulSoup(content, 'html.parser')
questions = soup.select('div#questions h3')
print(r.url)
print(len(questions))
Output
https://stackoverflow.com/questions/tagged/python?pageSize=5
50
Expected Output
https://stackoverflow.com/questions/tagged/python?pageSize=5
5
In making the above request, stackoverflow.com appears to be semi-ignoring the pageSize parameter. I say semi-ignoring, because r.text does contain '<meta property="og:url" content="https://stackoverflow.com/questions/tagged/python?pageSize=5"/>', which indicates that it is aware of the parameter. But it returns 50 questions. If you go to https://stackoverflow.com/questions/tagged/python?pageSize=5 directly, it only returns 5 questions.
Is there a way to get stackoverflow.com to respect the URL parameter sent via http requests?
Upvotes: 2
Views: 1945
Reputation: 2130
The problem was your User-Agent
, So requests
headers look like this
{'User-Agent': 'python-requests/2.19.1', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
Notice your User-Agent
it is 'python-requests' So StackOverflow is ignoring the query parameters because it knows it is not coming from a real browser So to overcome this you can simply pass empty headers while making a request like this,
requests.get(url, headers='')
Upvotes: 1