How to find all tab hidden href during web scrape?

Question

On the right-hand side in this website there are several tabs which include documents to view.

The underlying code is a a tag with a partial href linking the doc location. I've been trying to grab all these documents (which usually start with URL '/documents/') but had no success.

When I scrape, I only ever seem to grab the first set of documents found in a tab with 'Hearing Document' table. I share an insert of a code which I attempted to grab all href in this page.

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.jud11.flcourts.org/Judge-Details?judgeid=1063§ionid=2')
soup = BeautifulSoup(page.content, 'html.parser')

for link in soup.find_all("a"):
    if link.has_attr('href'):
        print(link['href'])

where the output is only the document in the first tab (in this instance), I share a snippet:

#collapse1
#collapse2
/documents/judges_forms/1062458802-Ex%20Parte%20Motions%20to%20Compel%20Discovery.pdf
/documents/judges_forms/1062459053-JointCaseMgtReport121.pdf
#collapse4
#collapse6

Does someone know how to get the following that do exist in this same page (I list below)? (I would say confirm this using the Inspect Element feature on your browser, but it will not show it. You have to go to the tab which tables 'Hearing Documents' and then Inspect Element)

/documents/judges_forms/1422459010-Order%20Granting%20Motion%20to%20Withdraw.docx

/documents/judges_forms/1422459046-ORDER%20ON%20Attorneys%20Fees.docx

Thanks for any help!

Andrej Kesely · Accepted Answer

You can use this example to get links to documents from other tabs:

import requests
from bs4 import BeautifulSoup


url = 'https://www.jud11.flcourts.org/Judge-Details?judgeid=1063§ionid=2'
headers = {'X-MicrosoftAjax': 'Delta=true',
           'X-Requested-With': 'XMLHttpRequest'}

with requests.session() as s:

    soup = BeautifulSoup(s.get(url).content, 'html.parser')

    data = {}
    for i in soup.select('input[name]'):
        data[i['name']] = i.get('value', '')

    for page in range(0, 6):
        print('Tab no.{}..'.format(page))
        data['ScriptManager'] = "ScriptManager|dnn$ctr1843$View$rtSectionHearingTypes"
        data['__EVENTARGUMENT'] = '{"type":0,"index":"' + str(page) + '"}'
        data['__EVENTTARGET'] ="dnn$ctr1843$View$rtSectionHearingTypes"
        data['dnn_ctr1843_View_rtSectionHearingTypes_ClientState'] = '{"selectedIndexes":["' + str(page) + '"],"logEntries":[],"scrollState":{}}'
        data['__ASYNCPOST'] = "true"
        data['RadAJAXControlID'] = "dnn_ctr1843_View_RadAjaxManager1"

        soup = BeautifulSoup( s.post(url, headers=headers, data=data).content, 'html.parser' )
        for a in soup.select('a[href*="documents"]'):
            print('https://www.jud11.flcourts.org' + a['href'])

Prints:

Tab no.0..
https://www.jud11.flcourts.org/documents/judges_forms/1062458802-Ex%20Parte%20Motions%20to%20Compel%20Discovery.pdf
https://www.jud11.flcourts.org/documents/judges_forms/1062459053-JointCaseMgtReport121.pdf
Tab no.1..
Tab no.2..
Tab no.3..
Tab no.4..
https://www.jud11.flcourts.org/documents/judges_forms/1422459010-Order%20Granting%20Motion%20to%20Withdraw.docx
https://www.jud11.flcourts.org/documents/judges_forms/1422459046-ORDER%20ON%20Attorneys%20Fees.docx
Tab no.5..
https://www.jud11.flcourts.org/documents/judges_forms/1512459051-Evidence%20Procedures.docx

How to find all tab hidden href during web scrape?

Answers (1)

Related Questions