Reputation: 8764
I'm trying to parse this feed: https://feeds.podcastmirror.com/dudesanddadspodcast
The channel
section has two entries for podcast:person
<podcast:person role="host" img="https://dudesanddadspodcast.com/files/2019/03/andy.jpg" href="https://www.podchaser.com/creators/andy-lehman-107aRuVQLA">Andy Lehman</podcast:person>
<podcast:person role="host" img="https://dudesanddadspodcast.com/files/2019/03/joel.jpg" href="https://www.podchaser.com/creators/joel-demott-107aRuVQLH" >Joel DeMott</podcast:person>
When parsed, feedparser only brings in one name
> import feedparser
> d = feedparser.parse('https://feeds.podcastmirror.com/dudesanddadspodcast')
> d.feed['podcast_person']
> {'role': 'host', 'img': 'https://dudesanddadspodcast.com/files/2019/03/joel.jpg', 'href': 'https://www.podchaser.com/creators/joel-demott-107aRuVQLH'}
What would I change so it would instead show a list for podcast_person
so I could loop through each one?
Upvotes: 2
Views: 315
Reputation: 571
Since I'm familiar withlxml
, and considering the fact that someone has already posted a solution using feedparser
, I wanted to test howlxml
could be used to parse an RSS feed. In my opinion, the daunting part is the handling of the RSS namespaces, but once that is resolved the task becomes quite easy:
import urllib.request
from lxml import etree
feed = etree.parse(urllib.request.urlopen('https://feeds.podcastmirror.com/dudesanddadspodcast')).getroot()
namespaces = {
'itunes': 'http://www.itunes.com/dtds/podcast-1.0.dtd'
}
for episode in feed.iter('item'):
# print(etree.tostring(episode))
authors = episode.xpath('itunes:author/text()', namespaces=namespaces)
print(authors)
#title = episode.xpath('itunes:title/text()', namespaces=namespaces)
#episode_metadata = '{} - {}'.format(title[0] if title else 'Missing title', authors[0] if authors else 'Missing authors')
#print(episode_metadata)
The execution time of the code above is close to 3x faster compared to a similar solution with feedparser
, reflecting the performance gains from using lxml
as the parsing library.
Upvotes: 0
Reputation: 5053
Instead of feedparser
I would prefer BeautifulSoup
.
You can copy the below code to test the end results.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://feeds.podcastmirror.com/dudesanddadspodcast").content
soup = BeautifulSoup(r, 'html.parser')
feeds = soup.find_all("podcast:person")
print(type(feeds)) # <list>
# You can loop the `feeds` variable.
Upvotes: 1
Reputation: 507
You can iterate over the feed['items']
and get all the records.
import feedparser
feed = feedparser.parse('https://feeds.podcastmirror.com/dudesanddadspodcast')
if feed:
for item in feed['items']:
print(f'{item["title"]} - {item["author"]}')
Upvotes: 1
Reputation: 455
Idea #1:
from bs4 import BeautifulSoup
import requests
r = requests.get("https://feeds.podcastmirror.com/dudesanddadspodcast").content
soup = BeautifulSoup(r, 'html.parser')
soup.find_all("podcast:person")
Output:
[<podcast:person href="https://www.podchaser.com/creators/andy-lehman-107aRuVQLA" img="https://dudesanddadspodcast.com/files/2019/03/andy.jpg" role="host">Andy Lehman</podcast:person>,
<podcast:person href="https://www.podchaser.com/creators/joel-demott-107aRuVQLH" img="https://dudesanddadspodcast.com/files/2019/03/joel.jpg" role="host">Joel DeMott</podcast:person>,
<podcast:person href="https://www.podchaser.com/creators/cory-martin-107aRwmCuu" img="" role="guest">Cory Martin</podcast:person>,
<podcast:person href="https://www.podchaser.com/creators/julie-lehman-107aRuVQPL" img="" role="guest">Julie Lehman</podcast:person>]
Idea #2:
import feedparser
d = feedparser.parse('https://feeds.podcastmirror.com/dudesanddadspodcast')
hosts = d.entries[1]['authors'][1]['name'].split(", ")
print("The hosts of this Podcast are {} and {}.".format(hosts[0], hosts[1]))
Output:
The hosts of this Podcast are Joel DeMott and Andy Lehman.
Upvotes: 1