steveeweeveewoo
steveeweeveewoo

Reputation: 99

Extract links after th in beautifulsoup

Im trying to extract links from this page: http://www.tadpoletunes.com/tunes/celtic1/ view-source:http://www.tadpoletunes.com/tunes/celtic1/ but I only want the reels: which in the page are delineated by : start:

<th align="left"><b><a name="reels">REELS</a></b></th>

end ( the lines above the following):

<th align="left"><b><a name="slides">SLIDES</a></b></th>

The question is how to do this. I have the following code which gets the links for everything with a .mid extension:

def import_midifiles():
    archive_url="http://www.tadpoletunes.com/tunes/celtic1/" 
    sauce= urllib.request.urlopen("http://www.tadpoletunes.com/tunes/celtic1/celtic.htm").read()
    soup=bs.BeautifulSoup(sauce,'lxml')
    tables=soup.find_all('table')
    for table in tables:
        for link in table.find_all('a',href=True):
            if link['href'].endswith('.mid'):
                listofmidis.append(archive_url + link['href'])
        if listofmidis:
            listoflists.append(listofmidis)
    midi_list = [item for sublist in listoflists for item in sublist]
    return midi_list

I cannot figure this out from the beautifulsoup docs. I need the code because I will be repeating the activity on other sites in order to scrape data for training a model.

Upvotes: 1

Views: 84

Answers (1)

Keyur Potdar
Keyur Potdar

Reputation: 7248

To get all the "REELS" links, you need to do the following:

Get the links in between "REELS" and "SLIDES" as you mentioned. To do that, first you'll need to find the <tr> tag containing <a name="reels">REELS</a>. This can be done using the .find_parent() method.

reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')

Now, you can use the .find_next_siblings() method to get all the <tr> tags after "REELS". We can break the loop when we find the <tr> tag with <a name="slides">SLIDES</a> (or .find('a').text == 'SLIDES').

Complete code:

def import_midifiles():
    BASE_URL = 'http://www.tadpoletunes.com/tunes/celtic1/'
    r = requests.get(BASE_URL)
    soup = BeautifulSoup(r.text, 'lxml')
    midi_list = []
    reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')
    for tr in reels_tr.find_next_siblings('tr'):
        if tr.find('a').text == 'SLIDES':
            break
        midi_list.append(BASE_URL + tr.find('a')['href'])
    return midi_list

print(import_midifiles())

Partial output:

['http://www.tadpoletunes.com/tunes/celtic1/ashplant.mid', 'http://www.tadpoletunes.com/tunes/celtic1/bashful.mid', 'http://www.tadpoletunes.com/tunes/celtic1/bigpat.mid', 'http://www.tadpoletunes.com/tunes/celtic1/birdcage.mid', 'http://www.tadpoletunes.com/tunes/celtic1/boatstre.mid',
...
...
'http://www.tadpoletunes.com/tunes/celtic1/silspear.mid', 'http://www.tadpoletunes.com/tunes/celtic1/stafreel.mid', 'http://www.tadpoletunes.com/tunes/celtic1/kilkenny.mid', 'http://www.tadpoletunes.com/tunes/celtic1/swaltail.mid', 'http://www.tadpoletunes.com/tunes/celtic1/cuptea.mid']

Upvotes: 1

Related Questions