Reputation: 825
I don't understand why I am getting a 403 error for some of these sites.
If I visit the URLs manually the pages load fine. There isn't any error message other that the 403 response, so I don't know how to diagnose the problem.
from bs4 import BeautifulSoup
import requests
test_sites = [
'http://fashiontoast.com/',
'http://becauseimaddicted.net/',
'http://www.lefashion.com/',
'http://www.seaofshoes.com/',
]
for site in test_sites:
print(site)
#get page soure
response = requests.get(site)
print(response)
#print(response.text)
Result of running the above code is...
http://fashiontoast.com/
Response [403]
http://becauseimaddicted.net/
Response [403]
http://www.lefashion.com/
Response [200]
http://www.seaofshoes.com/
Response [200]
Can anyone help me understand the cause of the problem and the solution please?
Upvotes: 3
Views: 2446
Reputation: 28630
Sometimes page rejects GET requests that do not identify a User-Agent.
Visit the page with a browser (Chrome). Right clcik then 'Inspect'. Copy the User-Agent header of the GET request (look in the Network tab.
from bs4 import BeautifulSoup
import requests
with requests.Session() as se:
se.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36",
"Accept-Encoding": "gzip, deflate",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en"
}
test_sites = [
'http://fashiontoast.com/',
'http://becauseimaddicted.net/',
'http://www.lefashion.com/',
'http://www.seaofshoes.com/',
]
for site in test_sites:
print(site)
#get page soure
response = se.get(site)
print(response)
#print(response.text)
Output:
http://fashiontoast.com/
<Response [200]>
http://becauseimaddicted.net/
<Response [200]>
http://www.lefashion.com/
<Response [200]>
http://www.seaofshoes.com/
<Response [200]>
Upvotes: 4