Reputation: 99
Im trying to extract links from this page: http://www.tadpoletunes.com/tunes/celtic1/ view-source:http://www.tadpoletunes.com/tunes/celtic1/ but I only want the reels: which in the page are delineated by : start:
<th align="left"><b><a name="reels">REELS</a></b></th>
end ( the lines above the following):
<th align="left"><b><a name="slides">SLIDES</a></b></th>
The question is how to do this. I have the following code which gets the links for everything with a .mid extension:
def import_midifiles():
archive_url="http://www.tadpoletunes.com/tunes/celtic1/"
sauce= urllib.request.urlopen("http://www.tadpoletunes.com/tunes/celtic1/celtic.htm").read()
soup=bs.BeautifulSoup(sauce,'lxml')
tables=soup.find_all('table')
for table in tables:
for link in table.find_all('a',href=True):
if link['href'].endswith('.mid'):
listofmidis.append(archive_url + link['href'])
if listofmidis:
listoflists.append(listofmidis)
midi_list = [item for sublist in listoflists for item in sublist]
return midi_list
I cannot figure this out from the beautifulsoup docs. I need the code because I will be repeating the activity on other sites in order to scrape data for training a model.
Upvotes: 1
Views: 84
Reputation: 7248
To get all the "REELS" links, you need to do the following:
Get the links in between "REELS" and "SLIDES" as you mentioned. To do that, first you'll need to find the <tr>
tag containing <a name="reels">REELS</a>
. This can be done using the .find_parent()
method.
reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')
Now, you can use the .find_next_siblings()
method to get all the <tr>
tags after "REELS". We can break the loop when we find the <tr>
tag with <a name="slides">SLIDES</a>
(or .find('a').text == 'SLIDES'
).
Complete code:
def import_midifiles():
BASE_URL = 'http://www.tadpoletunes.com/tunes/celtic1/'
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.text, 'lxml')
midi_list = []
reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')
for tr in reels_tr.find_next_siblings('tr'):
if tr.find('a').text == 'SLIDES':
break
midi_list.append(BASE_URL + tr.find('a')['href'])
return midi_list
print(import_midifiles())
Partial output:
['http://www.tadpoletunes.com/tunes/celtic1/ashplant.mid', 'http://www.tadpoletunes.com/tunes/celtic1/bashful.mid', 'http://www.tadpoletunes.com/tunes/celtic1/bigpat.mid', 'http://www.tadpoletunes.com/tunes/celtic1/birdcage.mid', 'http://www.tadpoletunes.com/tunes/celtic1/boatstre.mid',
...
...
'http://www.tadpoletunes.com/tunes/celtic1/silspear.mid', 'http://www.tadpoletunes.com/tunes/celtic1/stafreel.mid', 'http://www.tadpoletunes.com/tunes/celtic1/kilkenny.mid', 'http://www.tadpoletunes.com/tunes/celtic1/swaltail.mid', 'http://www.tadpoletunes.com/tunes/celtic1/cuptea.mid']
Upvotes: 1