judebox
judebox

Reputation: 59

Web scraping python not returning any content

I am trying to web scrape from "https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq". Specifically,under the div class = "socrata-table frozen-columns", all of the data-column name & data-column description. However, the code that I've written doesn't seem to be working(its not returning anything?)

import requests
from bs4 import BeautifulSoup
url = "https://data.lacity.org/A-Safe-City/Crime-Data-from-2010-to-Present/y8tr-7khq"
page = requests.get(url)
print(page.status_code)
soup=BeautifulSoup(page.content,'html.parser')


for col in soup.find_all("div", attrs={"class":"socrata-visualization-container loaded"})[0:1]:
   for tr in col.find_all("div",attrs={"class":"socrata-table frozen-columns"}):
      for data in tr.find_all("div",attrs={"class":"column-header-content"}):
        print(data.text)

is my code wrong?

Upvotes: 2

Views: 4601

Answers (3)

QHarr
QHarr

Reputation: 84465

The page is loaded dynamically and the data set is paged which would mean using browser automation to retrieve, which is slow. There is an API you can use. It has arguments which will allow you to return results in batches..

Read the API documentation here. This is going to be a much more efficient and reliable way of retrieving the data.

Use limit to determine # records retrieved at a time; use offset parameter to start next batch for new records. Example call here.

As it is a query you can actually tailor the other parameters as you would a SQL query to retrieve the desired result set. This also means you can probably write a very quick initial query to return the record count from the database which you can use to determine your end point for batch requests.

You could write a class based script that uses multiprocessing and grab these batches more efficiently.

import requests
import pandas as pd
from pandas.io.json import json_normalize

response  = requests.get('https://data.lacity.org/api/id/y8tr-7khq.json?$select=`dr_no`,`date_rptd`,`date_occ`,`time_occ`,`area_id`,`area_name`,`rpt_dist_no`,`crm_cd`,`crm_cd_desc`,`mocodes`,`vict_age`,`vict_sex`,`vict_descent`,`premis_cd`,`premis_desc`,`weapon_used_cd`,`weapon_desc`,`status`,`status_desc`,`crm_cd_1`,`crm_cd_2`,`crm_cd_3`,`crm_cd_4`,`location`,`cross_street`,`location_1`&$order=`date_occ`+DESC&$limit=100&$offset=0')
data = response.json()
data = json_normalize(data)
df = pd.DataFrame(data)
print(df)

Example record in JSON response:

enter image description here

Upvotes: 2

Vishnudev Krishnadas
Vishnudev Krishnadas

Reputation: 10970

This is because data is dynamically filled by ReactJs after page load.

If you download it via requests you can't see the data.

You need to use selenium web driver, open page and process all the JavaScript. Then you can get data you expect.

Upvotes: 0

Adrian
Adrian

Reputation: 105

If you look into page source (ctrl + U), you'll notice that there is no such element as <div class = "socrata-table frozen-columns">. It's because content you want to scrap is added dynamically to the page. Check out this question: web scraping dynamic content with python or Web scraping a website with dynamic javascript content

Upvotes: 0

Related Questions