user448518
user448518

Reputation: 76

I am trying to web scrape http://angel.co/bloomfire

I am trying to scrape data from a website https://angel.co/bloomfire

import requests
from bs4 import BeautifulSoup

res = requests.get('https://angel.co/pen-io')
soup = BeautifulSoup(res.content, 'html.parser')
print(soup.prettify())

This prints with title tag as "Page not found - 404 - AngelList". In webbrowser the website works fine, but its source code is not same as the output from my python script. I have also used selenium with phantomjs, but it shows the same thing

Upvotes: 1

Views: 1077

Answers (2)

wkl
wkl

Reputation: 79893

It looks like angel.co will respond with an HTTP 404 based on the User-Agent that is sent, and it looks like it will block the default requests agent (possibly depending on version). This is likely to discourage bot activity.

Some output from my ipython session follows. I'm using requests/2.17.3.

Using default Python-requests User-Agent

In [37]: rsp = requests.get('https://angel.co/bloom')
In [38]: rsp.status_code
Out[38]: 404

Using a Mozilla-compatible User-Agent

In [39]: rsp = requests.get('https://angel.co/bloom', headers={'User-Agent': 'Mozilla/5.0'})

In [40]: rsp.status_code
Out[40]: 200

rsp.content contains the content you'd expect to see from angel.co/bloom.

Using some random User-Agent

In [41]: rsp = requests.get('https://angel.co/bloom', headers={'User-Agent': 'birryree angel scraper'})

In [42]: rsp.status_code
Out[42]: 200

So you should be setting the User-Agent to get past any kind of filtering/blocking angel is using for various default agents.

If you're going to be doing heavy scraping, I'd suggest you be a good citizen and set an agent string that would let them contact you in case your scraping is causing issues, like:

requests.get('https://angel.co/bloom', 
             headers={'User-Agent': 'Mozilla/5.0 (compatible; http://yoursite.com)'}

Upvotes: 3

SIM
SIM

Reputation: 22440

Adding headers to the requests parameter you can reach the page. Here is the results for "PEOPLE ALSO VIEWED". Try the below script:

import requests
from bs4 import BeautifulSoup

res = requests.get('https://angel.co/pen-io', headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.select(".text"):
    try:
        title = item.select_one("a.startup-link").get_text()
    except:
        title = ''
    print(title)

Results:

Corilla
Pronoun
checkthis
Wattpad
Medium
Plympton
Cheezburger
AngelList

Upvotes: 0

Related Questions