Reputation: 121
On the right-hand side in this website there are several tabs which include documents to view.
The underlying code is a a tag with a partial href linking the doc location. I've been trying to grab all these documents (which usually start with URL '/documents/') but had no success.
When I scrape, I only ever seem to grab the first set of documents found in a tab with 'Hearing Document' table. I share an insert of a code which I attempted to grab all href in this page.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.jud11.flcourts.org/Judge-Details?judgeid=1063§ionid=2')
soup = BeautifulSoup(page.content, 'html.parser')
for link in soup.find_all("a"):
if link.has_attr('href'):
print(link['href'])
where the output is only the document in the first tab (in this instance), I share a snippet:
#collapse1
#collapse2
/documents/judges_forms/1062458802-Ex%20Parte%20Motions%20to%20Compel%20Discovery.pdf
/documents/judges_forms/1062459053-JointCaseMgtReport121.pdf
#collapse4
#collapse6
Does someone know how to get the following that do exist in this same page (I list below)? (I would say confirm this using the Inspect Element feature on your browser, but it will not show it. You have to go to the tab which tables 'Hearing Documents' and then Inspect Element)
/documents/judges_forms/1422459010-Order%20Granting%20Motion%20to%20Withdraw.docx
/documents/judges_forms/1422459046-ORDER%20ON%20Attorneys%20Fees.docx
Thanks for any help!
Upvotes: 1
Views: 359
Reputation: 195543
You can use this example to get links to documents from other tabs:
import requests
from bs4 import BeautifulSoup
url = 'https://www.jud11.flcourts.org/Judge-Details?judgeid=1063§ionid=2'
headers = {'X-MicrosoftAjax': 'Delta=true',
'X-Requested-With': 'XMLHttpRequest'}
with requests.session() as s:
soup = BeautifulSoup(s.get(url).content, 'html.parser')
data = {}
for i in soup.select('input[name]'):
data[i['name']] = i.get('value', '')
for page in range(0, 6):
print('Tab no.{}..'.format(page))
data['ScriptManager'] = "ScriptManager|dnn$ctr1843$View$rtSectionHearingTypes"
data['__EVENTARGUMENT'] = '{"type":0,"index":"' + str(page) + '"}'
data['__EVENTTARGET'] ="dnn$ctr1843$View$rtSectionHearingTypes"
data['dnn_ctr1843_View_rtSectionHearingTypes_ClientState'] = '{"selectedIndexes":["' + str(page) + '"],"logEntries":[],"scrollState":{}}'
data['__ASYNCPOST'] = "true"
data['RadAJAXControlID'] = "dnn_ctr1843_View_RadAjaxManager1"
soup = BeautifulSoup( s.post(url, headers=headers, data=data).content, 'html.parser' )
for a in soup.select('a[href*="documents"]'):
print('https://www.jud11.flcourts.org' + a['href'])
Prints:
Tab no.0..
https://www.jud11.flcourts.org/documents/judges_forms/1062458802-Ex%20Parte%20Motions%20to%20Compel%20Discovery.pdf
https://www.jud11.flcourts.org/documents/judges_forms/1062459053-JointCaseMgtReport121.pdf
Tab no.1..
Tab no.2..
Tab no.3..
Tab no.4..
https://www.jud11.flcourts.org/documents/judges_forms/1422459010-Order%20Granting%20Motion%20to%20Withdraw.docx
https://www.jud11.flcourts.org/documents/judges_forms/1422459046-ORDER%20ON%20Attorneys%20Fees.docx
Tab no.5..
https://www.jud11.flcourts.org/documents/judges_forms/1512459051-Evidence%20Procedures.docx
Upvotes: 1