prashantgpt91
prashantgpt91

Reputation: 1795

unable to scrape

enter image description here

I am trying to get the list of the companies from angellist https://angel.co/companies

I tried with this code

from bs4 import BeautifulSoup
import urllib2

headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('https://angel.co/companies', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, "html.parser")
p1 = soup.find_all('div' , {"class"," dc59 frw44 _a _jm"})
print p1

But this returns an empty string.

I had gone through similar questions, some say update beautifulsoup, some say change parser. Nothing is working for me.

Upvotes: 1

Views: 916

Answers (3)

Padraic Cunningham
Padraic Cunningham

Reputation: 180522

You can get all the company info html without needing selenium by getting the params from https://angel.co/company_filters/search_data:

import requests
from bs4 import BeautifulSoup



js = "https://angel.co/company_filters/search_data"

headers = {"X-Requested-With": "XMLHttpRequest",
           "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}




u = "https://angel.co/companies/startups?ids%5B%5D={}&total={}&page={}&sort=signal&new=false&hexdigest={}"
with requests.Session() as s:
    params = s.post(js, data={"sort": "signal"}, headers=headers).json()
    companies = s.get(u.format("&ids%5B%5D=".join(map(str, params["ids"])),params["page"] ,params["total"], params["hexdigest"]), headers=headers)
    soup = BeautifulSoup(companies.json()["html"])

You can pass the page number as you iterate to simulate the load more:

import requests
from bs4 import BeautifulSoup
import time

# post url
js = "https://angel.co/company_filters/search_data"

# X-Requested-With is important
headers = {"X-Requested-With": "XMLHttpRequest",
           "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36"}


# get url
u = "https://angel.co/companies/startups?ids%5B%5D={}&total={}&page={}&sort=signal&new=false&hexdigest={}"


def get_next_pages(js, u, start_page=1):
    with requests.Session() as s:
        params = s.post(js, data={"sort": "signal","page":start_page}, headers=headers).json()
        companies = s.get(
            u.format("&ids%5B%5D=".join(map(str, params["ids"])), params["page"], params["total"], params["hexdigest"]),
            headers=headers)
        soup = BeautifulSoup(companies.json()["html"])
        comps = soup.select("div.company.column")
        yield comps
        while True:
            # increment page count from previous.
            page = params["page"] + 1
            params = s.post(js, data={"sort": "signal", "page": page}, headers=headers).json()
            # keep going until we have reached the maximum queries
            if "ids" not in params:
                break
            companies = s.get(u.format("&ids%5B%5D=".join(map(str, params["ids"])), params["page"], params["total"],
                                       params["hexdigest"]),
                              headers=headers)
            soup = BeautifulSoup(companies.json()["html"])
            comps = soup.select("div.company.column")
            # don't hammer with requests
            time.sleep(.3)
            yield comps

for comps in get_next_pages(js, u):
    print(comps)

If we look at the network output from developer tools, we can see the post data as we hit load more, it keeps going until we hit out limit:

enter image description here

A snippet of the output from running the code above:

[<div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="275696" data-type="Startup" href="https://angel.co/dunwello?utm_source=companies" title="Dunwello"><img alt="Dunwello" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/275696-99335faecd2fb01467c98d5032f23cf6-thumb_jpg.jpg?buster=1393099676"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="275696" data-type="Startup" href="https://angel.co/dunwello?utm_source=companies">Dunwello</a>
</div>
<div class="pitch">
Trustworthy recommendations of individual professionals.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="275832" data-type="Startup" href="https://angel.co/groupahead?utm_source=companies" title="GroupAhead"><img alt="GroupAhead" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/275832-3541a563250008bd3f7f9b4d7fe9c33c-thumb_jpg.jpg?buster=1423077576"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="275832" data-type="Startup" href="https://angel.co/groupahead?utm_source=companies">GroupAhead</a>
</div>
<div class="pitch">
Dedicated apps for groups
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="431492" data-type="Startup" href="https://angel.co/workpop?utm_source=companies" title="Workpop"><img alt="Workpop" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/431492-c1b857e30254da60f3847d5358db5c82-thumb_jpg.jpg?buster=1404420060"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="431492" data-type="Startup" href="https://angel.co/workpop?utm_source=companies">Workpop</a>
</div>
<div class="pitch">
When can you start?
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="446358" data-type="Startup" href="https://angel.co/late-stage-pre-ipo-syndicate?utm_source=companies" title="Late Stage Pre-IPO @ Flight.vc"><img alt="Late Stage Pre-IPO @ Flight.vc" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/446358-3511ab7edb5192dad97cbccf2b67ddd7-thumb_jpg.jpg?buster=1428089778"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="446358" data-type="Startup" href="https://angel.co/late-stage-pre-ipo-syndicate?utm_source=companies">Late Stage Pre-IPO @ Flight.vc</a>
</div>
<div class="pitch">
Syndicated:  Beepi, Zirx, Boost Media, Rent the Runway, Life 360, Scripted
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="450451" data-type="Startup" href="https://angel.co/complex-polygon?utm_source=companies" title="Complex Polygon"><img alt="Complex Polygon" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/450451-4f00fd11b2d54533a5bac3cfa72acb1e-thumb_jpg.jpg?buster=1407937645"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="450451" data-type="Startup" href="https://angel.co/complex-polygon?utm_source=companies">Complex Polygon</a>
</div>
<div class="pitch">
Product studio based in San Francisco, California. 
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="457068" data-type="Startup" href="https://angel.co/21?utm_source=companies" title="21"><img alt="21" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/457068-2e7b8c417c3a70aab3026f5f0ca3d8e9-thumb_jpg.jpg?buster=1425975133"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="457068" data-type="Startup" href="https://angel.co/21?utm_source=companies">21</a>
</div>
<div class="pitch">
A bitcoin miner in every device and in every hand.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="460720" data-type="Startup" href="https://angel.co/parenthoods?utm_source=companies" title="Parenthoods"><img alt="Parenthoods" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/460720-25bc7ca7afd4f7bf0fd7842cafa1bdd1-thumb_jpg.jpg?buster=1425426951"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="460720" data-type="Startup" href="https://angel.co/parenthoods?utm_source=companies">Parenthoods</a>
</div>
<div class="pitch">
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="462906" data-type="Startup" href="https://angel.co/seed-8?utm_source=companies" title="Seed"><img alt="Seed" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/462906-f6b439e20a9d36b9e2d3792da92d160d-thumb_jpg.jpg?buster=1462318689"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="462906" data-type="Startup" href="https://angel.co/seed-8?utm_source=companies">Seed</a>
</div>
<div class="pitch">
Online Business Banking
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="470102" data-type="Startup" href="https://angel.co/zen99?utm_source=companies" title="Zen99"><img alt="Zen99" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/470102-67da791cec4374a1046c53fe99b6f05f-thumb_jpg.jpg?buster=1410560341"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="470102" data-type="Startup" href="https://angel.co/zen99?utm_source=companies">Zen99</a>
</div>
<div class="pitch">
Finance and insurance tools for freelancers
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="488240" data-type="Startup" href="https://angel.co/maven-ventures-growth-labs?utm_source=companies" title="Maven Ventures Growth Labs"><img alt="Maven Ventures Growth Labs" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/488240-d467860829cac8b1a9fbfa2d14e05789-thumb_jpg.jpg?buster=1411577330"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="488240" data-type="Startup" href="https://angel.co/maven-ventures-growth-labs?utm_source=companies">Maven Ventures Growth Labs</a>
</div>
<div class="pitch">
Get a option to invest up to $500k in the best Maven grads
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="507975" data-type="Startup" href="https://angel.co/skydio?utm_source=companies" title="Skydio"><img alt="Skydio" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/507975-aac9786d6c4cba99be634b7bc1969cf3-thumb_jpg.jpg?buster=1420952326"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="507975" data-type="Startup" href="https://angel.co/skydio?utm_source=companies">Skydio</a>
</div>
<div class="pitch">
MIT, Google[x]ers with deep prior experience doing intelligent navigation for drones
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="517240" data-type="Startup" href="https://angel.co/fin-tech-syndicate?utm_source=companies" title="Fin Tech by Flight.vc"><img alt="Fin Tech by Flight.vc" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/517240-5bc50eb42d1e40a8ad437c6bd164a5a8-thumb_jpg.jpg?buster=1414004533"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="517240" data-type="Startup" href="https://angel.co/fin-tech-syndicate?utm_source=companies">Fin Tech by Flight.vc</a>
</div>
<div class="pitch">
Investing in Financial Services and Fin-Tech that has proprietary advantages
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="521452" data-type="Startup" href="https://angel.co/channel-app?utm_source=companies" title="Channel"><img alt="Channel" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/521452-b6bc15ef040fdf37d885aea71ecad3bb-thumb_jpg.jpg?buster=1446676191"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="521452" data-type="Startup" href="https://angel.co/channel-app?utm_source=companies">Channel</a>
</div>
<div class="pitch">
Watch the world.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="443932" data-type="Startup" href="https://angel.co/healthsherpa?utm_source=companies" title="HealthSherpa"><img alt="HealthSherpa" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/443932-63c6bcbbf9ba36a7fa3e532177222c9b-thumb_jpg.jpg?buster=1462374897"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="443932" data-type="Startup" href="https://angel.co/healthsherpa?utm_source=companies">HealthSherpa</a>
</div>
<div class="pitch">
Next-generation Healthcare.gov
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="558206" data-type="Startup" href="https://angel.co/sidewire?utm_source=companies" title="Sidewire"><img alt="Sidewire" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/558206-b416bf8347c7f766b5ea1cf79123c4d2-thumb_jpg.jpg?buster=1444189112"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="558206" data-type="Startup" href="https://angel.co/sidewire?utm_source=companies">Sidewire</a>
</div>
<div class="pitch">
Where Experts Chat in Public
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="570055" data-type="Startup" href="https://angel.co/brainchild-1?utm_source=companies" title="Brainchild &amp;amp; Co."><img alt="Brainchild &amp; Co." class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/570055-cc2c2309fefa21e3ebda6229d6a0b890-thumb_jpg.jpg?buster=1420474118"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="570055" data-type="Startup" href="https://angel.co/brainchild-1?utm_source=companies">Brainchild &amp; Co.</a>
</div>
<div class="pitch">
Building services and products for consumers
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="571060" data-type="Startup" href="https://angel.co/signatures-capital?utm_source=companies" title="Signatures Capital"><img alt="Signatures Capital" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/571060-8a077d7cbac9cc7e2d81859adb8cd1c6-thumb_jpg.jpg?buster=1420664121"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="571060" data-type="Startup" href="https://angel.co/signatures-capital?utm_source=companies">Signatures Capital</a>
</div>
<div class="pitch">
Supporting founders committed to inventing the future.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="623000" data-type="Startup" href="https://angel.co/airtable?utm_source=companies" title="Airtable"><img alt="Airtable" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/623000-9d210a39051abc7accec1dc686888dcc-thumb_jpg.jpg?buster=1449952044"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="623000" data-type="Startup" href="https://angel.co/airtable?utm_source=companies">Airtable</a>
</div>
<div class="pitch">
Organize anything you can imagine
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="630861" data-type="Startup" href="https://angel.co/meerkat?utm_source=companies" title="Meerkat"><img alt="Meerkat" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/630861-820b9d4af09e110b150c9affe418d860-thumb_jpg.jpg?buster=1425688408"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="630861" data-type="Startup" href="https://angel.co/meerkat?utm_source=companies">Meerkat</a>
</div>
<div class="pitch">
Live Stream Video.
</div>
</div>
</div>
</div>, <div class="company column">
<div class="g-lockup">
<div class="photo">
<a class="startup-link" data-id="658877" data-type="Startup" href="https://angel.co/flight-vc-syndicate?utm_source=companies" title="Flight Ventures"><img alt="Flight Ventures" class="angel_image" src="https://d1qb2nb5cznatu.cloudfront.net/startups/i/658877-89ccd88502db9d964a651ecba6f86d9d-thumb_jpg.jpg?buster=1457552637"/></a>
</div>
<div class="text">
<div class="name">
<a class="startup-link" data-id="658877" data-type="Startup" href="https://angel.co/flight-vc-syndicate?utm_source=companies">Flight Ventures</a>
</div>
<div class="pitch">
Investing in the Top Companies and Entrepreneurs
</div>
</div>
</div>
</div>]

There are more filters etc.. you can set, if you want to see how just select them in the browser and watch how the requests are made in firebug or developer tools under the xhr tab under Network.

Upvotes: 7

kreddyio
kreddyio

Reputation: 155

In your case, it seems that all div elements with the class frw44 are being generated dynamically with js. You cannot get the data that is being generated dynamically using javascript by using traditional urllib, urllib2 or requests modules (or even mechanize for that matter). You'll have to simulate a browser environment by using selenium with chrome or Firefox or phantomjs to evaluate the javascript in the webpage.

Have a look at Selenium Binding for python

The following has been tested and verified by me

from bs4 import BeautifulSoup as bs
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://angel.co/companies")
html = driver.page_source
driver.quit()
soup = bs(html,"html.parser")
p1 = soup.findAll('div' , {"class":" dc59 frw44 _a _jm"})
print p1

Upvotes: 0

Dušan Maďar
Dušan Maďar

Reputation: 9909

The data you want to extract are generated by JavaScript. That is why p1 is an empty list; urllib2.urlopen(req).read() gives you the server response, it doesn't wait for JS.

Use BeautifulSoup in combination with Selenium.

from bs4 import BeautifulSoup
from selenium import webdriver

browser = webdriver.Firefox()
browser.get('https://angel.co/companies')
html = browser.page_source

soup = BeautifulSoup(html, "html.parser")
p1 = soup.find_all('div' , {"class", " dc59 frw44 _a _jm"})
print p1

Also, if this won't work (not tested), make the class selector simpler, i.e. try searching for dc59 only and make it gradually more specific.

Upvotes: 2

Related Questions