Extract links after th in beautifulsoup

Question

Im trying to extract links from this page: http://www.tadpoletunes.com/tunes/celtic1/ view-source:http://www.tadpoletunes.com/tunes/celtic1/ but I only want the reels: which in the page are delineated by : start:

REELS

end ( the lines above the following):

SLIDES

The question is how to do this. I have the following code which gets the links for everything with a .mid extension:

def import_midifiles():
    archive_url="http://www.tadpoletunes.com/tunes/celtic1/" 
    sauce= urllib.request.urlopen("http://www.tadpoletunes.com/tunes/celtic1/celtic.htm").read()
    soup=bs.BeautifulSoup(sauce,'lxml')
    tables=soup.find_all('table')
    for table in tables:
        for link in table.find_all('a',href=True):
            if link['href'].endswith('.mid'):
                listofmidis.append(archive_url + link['href'])
        if listofmidis:
            listoflists.append(listofmidis)
    midi_list = [item for sublist in listoflists for item in sublist]
    return midi_list

I cannot figure this out from the beautifulsoup docs. I need the code because I will be repeating the activity on other sites in order to scrape data for training a model.

Keyur Potdar · Accepted Answer

To get all the "REELS" links, you need to do the following:

Get the links in between "REELS" and "SLIDES" as you mentioned. To do that, first you'll need to find the tag containing REELS. This can be done using the .find_parent() method.

reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')

Now, you can use the .find_next_siblings() method to get all the tags after "REELS". We can break the loop when we find the tag with SLIDES (or .find('a').text == 'SLIDES').

Complete code:

def import_midifiles():
    BASE_URL = 'http://www.tadpoletunes.com/tunes/celtic1/'
    r = requests.get(BASE_URL)
    soup = BeautifulSoup(r.text, 'lxml')
    midi_list = []
    reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')
    for tr in reels_tr.find_next_siblings('tr'):
        if tr.find('a').text == 'SLIDES':
            break
        midi_list.append(BASE_URL + tr.find('a')['href'])
    return midi_list

print(import_midifiles())

Partial output:

['http://www.tadpoletunes.com/tunes/celtic1/ashplant.mid', 'http://www.tadpoletunes.com/tunes/celtic1/bashful.mid', 'http://www.tadpoletunes.com/tunes/celtic1/bigpat.mid', 'http://www.tadpoletunes.com/tunes/celtic1/birdcage.mid', 'http://www.tadpoletunes.com/tunes/celtic1/boatstre.mid',
...
...
'http://www.tadpoletunes.com/tunes/celtic1/silspear.mid', 'http://www.tadpoletunes.com/tunes/celtic1/stafreel.mid', 'http://www.tadpoletunes.com/tunes/celtic1/kilkenny.mid', 'http://www.tadpoletunes.com/tunes/celtic1/swaltail.mid', 'http://www.tadpoletunes.com/tunes/celtic1/cuptea.mid']

Extract links after th in beautifulsoup

Answers (1)

Related Questions