Reputation: 76
I am trying to scrape data from a website https://angel.co/bloomfire
import requests
from bs4 import BeautifulSoup
res = requests.get('https://angel.co/pen-io')
soup = BeautifulSoup(res.content, 'html.parser')
print(soup.prettify())
This prints with title tag as "Page not found - 404 - AngelList". In webbrowser the website works fine, but its source code is not same as the output from my python script. I have also used selenium with phantomjs, but it shows the same thing
Upvotes: 1
Views: 1077
Reputation: 79893
It looks like angel.co will respond with an HTTP 404
based on the User-Agent
that is sent, and it looks like it will block the default requests
agent (possibly depending on version). This is likely to discourage bot activity.
Some output from my ipython
session follows. I'm using requests/2.17.3
.
In [37]: rsp = requests.get('https://angel.co/bloom')
In [38]: rsp.status_code
Out[38]: 404
In [39]: rsp = requests.get('https://angel.co/bloom', headers={'User-Agent': 'Mozilla/5.0'})
In [40]: rsp.status_code
Out[40]: 200
rsp.content
contains the content you'd expect to see from angel.co/bloom.
In [41]: rsp = requests.get('https://angel.co/bloom', headers={'User-Agent': 'birryree angel scraper'})
In [42]: rsp.status_code
Out[42]: 200
So you should be setting the User-Agent
to get past any kind of filtering/blocking angel is using for various default agents.
If you're going to be doing heavy scraping, I'd suggest you be a good citizen and set an agent string that would let them contact you in case your scraping is causing issues, like:
requests.get('https://angel.co/bloom',
headers={'User-Agent': 'Mozilla/5.0 (compatible; http://yoursite.com)'}
Upvotes: 3
Reputation: 22440
Adding headers to the requests parameter you can reach the page. Here is the results for "PEOPLE ALSO VIEWED". Try the below script:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://angel.co/pen-io', headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.select(".text"):
try:
title = item.select_one("a.startup-link").get_text()
except:
title = ''
print(title)
Results:
Corilla
Pronoun
checkthis
Wattpad
Medium
Plympton
Cheezburger
AngelList
Upvotes: 0