Reputation: 79
I'm still new to Python and I've searched a bunch of similar requests on here but none of them are helping with this one particular website. If you see below you'll see the code I have to actually pull the page but anything I do to load this specific section (class head-loc) where the name, phone number, etc is will not work. I've tried using selenium and WebDriverWait amongst other methods. No luck.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.qualitycheck.org/quality-report/?service=Behavioral%20Health%2CChemical%20Dependency&ajax=1&json=1&callback=jQuery110205938799402161818_1589391496145&_=1589391496146&bsnid=21'
response = requests.get(URL)
soup = BeautifulSoup(response.text, 'lxml')
directory = soup.find_all('div', class_="head-loc")
# results = soup.find(id='head-loc')
print(soup.prettify())
At the end of the day once I figure out how to wait until this content loads and grab it, then I will need to make something to iterate over all the URL's based on the ID at the end. But, for now I need to figure out how to get this content in the first place.
Any help is appreciated.
Upvotes: 1
Views: 1651
Reputation: 39397
This page uses javascript to load data on the page.
BeautifulSoup does not execute javascript, all you get is the raw HTML page, similar to what you get when you select "View Source" from your browser.
You can either use WebDriver to pilot an actual browser to run the javascript, and wait for specific elements to appear, or you can look at the source code of that page and see the various JavaScript sections that look like this:
/* <![CDATA[ */
$(document).ready(function () {
if (BaseModule.getQueryVariable("print") == 'y') {
$("#divDemographicInfoLoading").hide();
$("#divDemographicInfo").hide();
return;
}
var data = new FormData();
data.append("f", "GetDemographicInfo");
data.append("bsnId", "21");
$.ajax({
type: "POST",
url: "/ajax/QualityReport/ajax.aspx",
contentType: false,
processData: false,
data: data,
dataType: 'json',
success: function (o) {
if (o.ResponseHtml == '') {
$(".NoDemographicInfo").html('<h4>Quality Report not available</h4>');
$("#divDemographicInfoLoading").hide();
$(".QualityReportControl").hide();
} else {
$("#divDemographicInfoLoading").hide();
$("#divDemographicInfo").html(o.ResponseHtml);
}
}
});
});
/* ]]> */
This is running an ajax query to get the demographic information. There are similar queries for other bits, such as the advanced certification programs, etc...
Those queries are simple POSTs with form parameters, so you may be able to reproduce the call. The following curl command is what I get when simplifying the request from inspecting the requests made by the page:
curl 'https://www.qualitycheck.org/ajax/QualityReport/ajax.aspx' \
--data-binary $'------WebKitFormBoundaryTdDWzPTC3arprVRS\r\nContent-Disposition: form-data; name="f"\r\n\r\nGetDemographicInfo\r\n------WebKitFormBoundaryTdDWzPTC3arprVRS\r\nContent-Disposition: form-data; name="bsnId"\r\n\r\n21\r\n------WebKitFormBoundaryTdDWzPTC3arprVRS--\r\n'\
-H 'Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryTdDWzPTC3arprVRS'
Looks a bit scary, but it's not that bad: the data-binary part is just the boundary (----WebKitFormBoundaryTdDWzPTC3arprVRS
) separating a couple of multi-part form elements like this one:
Content-Disposition: form-data; name="f"
GetDemographicInfo
f
is the name of the piece of information you're interested in, bsnId
appears to be the Id of that particular health service, in this case, 21.
The result is:
{
"Success":true,
"ResponseHtml":
"<div class=\"mod no-bottom-padding\">\r\n<div class=\"qr-head-logo\">\r\n<div><img src='/assets/1/6/content-icon-gold-seal.png' alt='Gold Seal' /></div>\r\n</div>\r\n<div class=\"qr-head-mod\">\r\n<div class=\"head-name\">\r\nNortheast Ohio Neighborhood Health Services, Inc. \r\n</div>\r\n<div class=\"head-loc\">\r\nHCO ID: 21<br>\r\n 4800 Payne Avenue <br>\r\n Cleveland, OH, 44103 <br>\r\n (216) 231-7700<br>\r\n<a target='_blank' href='http://www.neonhealth.org'>www.neonhealth.org</a>\r\n </div>\r\n </div>\r\n</div>\r\n"
}
which, once formatted nicely, gives this html:
<div class=\"mod no-bottom-padding\">
<div class=\"qr-head-logo\">
<div><img src='/assets/1/6/content-icon-gold-seal.png' alt='Gold Seal' /></div>
</div>
<div class=\"qr-head-mod\">
<div class=\"head-name\">
Northeast Ohio Neighborhood Health Services, Inc.
</div>
<div class=\"head-loc\">
HCO ID: 21<br>
4800 Payne Avenue <br>
Cleveland, OH, 44103 <br>
(216) 231-7700<br>
<a target='_blank' href='http://www.neonhealth.org'>www.neonhealth.org</a>
</div>
</div>
</div>
That should get you what you were looking for.
Upvotes: 2
Reputation: 152
import requests
import time
from bs4 import BeautifulSoup
URL = 'https://www.qualitycheck.org/quality-report/?service=Behavioral%20Health%2CChemical%20Dependency&ajax=1&json=1&callback=jQuery110205938799402161818_1589391496145&_=1589391496146&bsnid=21'
response = requests.get(URL)
time.sleep(amount)
soup = BeautifulSoup(response.text, 'lxml')
directory = soup.find_all('div', class_="head-loc")
# results = soup.find(id='head-loc')
print(soup.prettify())
change amount
with how long you want your program to wait for (in seconds)
Upvotes: 0