user3032140
user3032140

Reputation: 15

Finding specific frame in URL to scrape data using Python BeautifulSoup

I am a beginner at HTML and web scraping and am trying to get the below shown data using Python BeautifulSoup.

[
Theft06/24/15 08:47 PM2000 BLOCK OF S COLLEGE AV

Vandalism06/24/15 07:32 PM3600 BLOCK OF WELLBORN RD

Theft06/24/15 07:30 PM800 BLOCK OF RIO GRANDE LN

Theft06/24/15 06:40 PM1800 BLOCK OF FINFEATHER RD
]

But when I parse the site http://spotcrime.com/#77801, I can't see the div in the parsed URL so cannot get the data.

The code that I am using is:

html=urllib2.urlopen('http://spotcrime.com/#77801')

soup = BeautifulSoup(html.read())
print soup

Upvotes: 1

Views: 1457

Answers (2)

alecxe
alecxe

Reputation: 474171

Instead of a main crimes container, there is only this received by urlopen:

<div id="table_container" class="list-group crime-list" style="margin-top: -30px;">
  <h3>Loading Crime Data...</h3>
  <p>City and county crime map showing crime incident data down to neighborhood crime</p>
</div>

This is because the main container is constructed with the help of an additional API call to http://api.spotcrime.com/crimes.json endpoint and javascript logic being executed in the browser.

What you can do is to simulate that API call in your code with requests. Working example:

import requests

url = "http://spotcrime.com/#77801"
crimes_url = "http://api.spotcrime.com/crimes.json"

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.130 Safari/537.36'}
with requests.Session() as session:
    session.headers = headers

    session.get(url)

    data = {
        "lat": "30.6423514",
        "lon": "-96.3704778",
        "radius": "0.02",
        "key": "spotcrime-private-api-key",
        "_": "1435453242689"
    }
    response = session.get(crimes_url, data=data)
    response = response.json()
    for item in response["crimes"]:
        print item

It prints dictionaries corresponding to each row in the crime table:

{u'cdid': 64482204, u'lon': -96.3661035, u'lat': 30.6507387, u'link': u'http://spotcrime.com/crime/64482204-6737a0085bd9aff31548993910efa35a', u'address': u'2000 BLOCK OF S COLLEGE AV', u'date': u'06/24/15 08:47 PM', u'type': u'Theft'}
{u'cdid': 64482189, u'lon': -96.3594859, u'lat': 30.6299681, u'link': u'http://spotcrime.com/crime/64482189-345f4eca1c977f43e97ea4981f73d4de', u'address': u'3600 BLOCK OF WELLBORN RD', u'date': u'06/24/15 07:32 PM', u'type': u'Vandalism'}
...
{u'cdid': 64370976, u'lon': -96.361556, u'lat': 30.631685, u'link': u'http://spotcrime.com/crime/64370976-dc6e6dbb29fc7376c2b82356c45d281d', u'address': u'3600 BLOCK OF WELLBORN RD #802', u'date': u'06/18/15 12:37 PM', u'type': u'Arrest'}
{u'cdid': 64371003, u'lon': -96.3539954, u'lat': 30.6434707, u'link': u'http://spotcrime.com/crime/64371003-d9934d9b9d83c1867871701874c45523', u'address': u'2900 BLOCK OF S TEXAS AVENUE', u'date': u'06/18/15 09:56 AM', u'type': u'Vandalism'}

Upvotes: 0

Davey Struijk
Davey Struijk

Reputation: 61

You can't find the div because it's dynamically loaded and inserted by javascript. What you can do in this case however, is replicate the ajax request that fetches all this crime data.

It seems like it their internal api doesn't require any sort of authentication, so you can just go ahead and send the following api request: GET api.spotcrime.com/crimes.json?lat=30.639155&lon=-96.3647937&radius=0.02&key=spotcrime-private-api-key

As a bonus, you don't need to scrape the site as well, since everything is neatly returned as JSON objects.

Upvotes: 1

Related Questions