love2phish
love2phish

Reputation: 349

Web scraping an xml page in python?

I am confused as to how I would scrape all the links (that only contain the string "mp3") off a given xml page. The following code only returns empty brackets:

# Import required modules 
from lxml import html 
import requests 
  
# Request the page 
page = requests.get('https://feeds.megaphone.fm/darknetdiaries') 
  
# Parsing the page 
# (We need to use page.content rather than  
# page.text because html.fromstring implicitly 
# expects bytes as input.) 
tree = html.fromstring(page.content)   
  
# Get element using XPath 
buyers = tree.xpath('//enclosure[@url="mp3"]/text()') 
print(buyers)

Am I using @url wrong?

The links I am looking for:

enter image description here

Any help would be greatly appreciated!

Upvotes: 0

Views: 301

Answers (2)

carrvo
carrvo

Reputation: 638

It does not directly answer your question, but you could check out BeautifulSoup as an alternative (and it has an option to use lxml under the hoop too).

import lxml # early failure if not installed
from bs4 import BeautifulSoup
import requests 
  
# Request the page 
page = requests.get('https://feeds.megaphone.fm/darknetdiaries') 

# Parse
soup = BeautifulSoup(page.text, 'lxml')

# Find
#mp3 = [link['href'] for link in soup.find_all('a') if 'mp3' in link['href']]
# UPDATE - correct tag and attribute
mp3 = [link['url'] for link in soup.find_all('enclosure') if 'mp3' in link['url']]

Upvotes: 1

HedgeHog
HedgeHog

Reputation: 25048

What happens?

The following xpath wont work, as you mentioned it is the use of @url and also text()

//enclosure[@url="mp3"]/text()

Solution

The attribute url in any //enclosure should contain mp3 and then returned /@url

Change this line:

buyers = tree.xpath('//enclosure[@url="mp3"]/text()') 

to

buyers = tree.xpath('//enclosure[contains(@url,"mp3")]/@url') 

Output

['https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV9231072845.mp3?updated=1610644901',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV2643452814.mp3?updated=1609788944',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV5381316822.mp3?updated=1607279433',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV9145504181.mp3?updated=1607280708',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV4345070838.mp3?updated=1606110384',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV8112097820.mp3?updated=1604866665',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV2164178070.mp3?updated=1603781321',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV1107638673.mp3?updated=1610220449',
...]

Upvotes: 2

Related Questions