Brenden
Brenden

Reputation: 8764

Accessing duplicate feed tags using feedparser

I'm trying to parse this feed: https://feeds.podcastmirror.com/dudesanddadspodcast

The channel section has two entries for podcast:person

<podcast:person role="host" img="https://dudesanddadspodcast.com/files/2019/03/andy.jpg" href="https://www.podchaser.com/creators/andy-lehman-107aRuVQLA">Andy Lehman</podcast:person>
<podcast:person role="host" img="https://dudesanddadspodcast.com/files/2019/03/joel.jpg" href="https://www.podchaser.com/creators/joel-demott-107aRuVQLH" >Joel DeMott</podcast:person>

When parsed, feedparser only brings in one name

> import feedparser
> d = feedparser.parse('https://feeds.podcastmirror.com/dudesanddadspodcast')
> d.feed['podcast_person']
> {'role': 'host', 'img': 'https://dudesanddadspodcast.com/files/2019/03/joel.jpg', 'href': 'https://www.podchaser.com/creators/joel-demott-107aRuVQLH'}

What would I change so it would instead show a list for podcast_person so I could loop through each one?

Upvotes: 2

Views: 315

Answers (4)

micmalti
micmalti

Reputation: 571

Since I'm familiar withlxml, and considering the fact that someone has already posted a solution using feedparser, I wanted to test howlxml could be used to parse an RSS feed. In my opinion, the daunting part is the handling of the RSS namespaces, but once that is resolved the task becomes quite easy:

import urllib.request
from lxml import etree

feed = etree.parse(urllib.request.urlopen('https://feeds.podcastmirror.com/dudesanddadspodcast')).getroot()
namespaces = {
    'itunes': 'http://www.itunes.com/dtds/podcast-1.0.dtd'
}

for episode in feed.iter('item'):
    # print(etree.tostring(episode))
    authors = episode.xpath('itunes:author/text()', namespaces=namespaces)
    print(authors)
    #title = episode.xpath('itunes:title/text()', namespaces=namespaces)
    #episode_metadata = '{} - {}'.format(title[0] if title else 'Missing title', authors[0] if authors else 'Missing authors')
    #print(episode_metadata)

The execution time of the code above is close to 3x faster compared to a similar solution with feedparser, reflecting the performance gains from using lxml as the parsing library.

Upvotes: 0

Mayur
Mayur

Reputation: 5053

Instead of feedparser I would prefer BeautifulSoup.

You can copy the below code to test the end results.

from bs4 import BeautifulSoup
import requests

r = requests.get("https://feeds.podcastmirror.com/dudesanddadspodcast").content
soup = BeautifulSoup(r, 'html.parser')

feeds = soup.find_all("podcast:person")

print(type(feeds))  # <list>

# You can loop the `feeds` variable.

Upvotes: 1

Max Khrichtchatyi
Max Khrichtchatyi

Reputation: 507

You can iterate over the feed['items'] and get all the records.

import feedparser

feed = feedparser.parse('https://feeds.podcastmirror.com/dudesanddadspodcast')

if feed:
    for item in feed['items']:
        print(f'{item["title"]} - {item["author"]}')

Upvotes: 1

thenarfer
thenarfer

Reputation: 455

Idea #1:

from bs4 import BeautifulSoup
import requests

r = requests.get("https://feeds.podcastmirror.com/dudesanddadspodcast").content
soup = BeautifulSoup(r, 'html.parser')

soup.find_all("podcast:person")

Output:

[<podcast:person href="https://www.podchaser.com/creators/andy-lehman-107aRuVQLA" img="https://dudesanddadspodcast.com/files/2019/03/andy.jpg" role="host">Andy Lehman</podcast:person>,
 <podcast:person href="https://www.podchaser.com/creators/joel-demott-107aRuVQLH" img="https://dudesanddadspodcast.com/files/2019/03/joel.jpg" role="host">Joel DeMott</podcast:person>,
 <podcast:person href="https://www.podchaser.com/creators/cory-martin-107aRwmCuu" img="" role="guest">Cory Martin</podcast:person>,
 <podcast:person href="https://www.podchaser.com/creators/julie-lehman-107aRuVQPL" img="" role="guest">Julie Lehman</podcast:person>]

Idea #2:

import feedparser

d = feedparser.parse('https://feeds.podcastmirror.com/dudesanddadspodcast')
hosts = d.entries[1]['authors'][1]['name'].split(", ")

print("The hosts of this Podcast are {} and {}.".format(hosts[0], hosts[1]))

Output:

The hosts of this Podcast are Joel DeMott and Andy Lehman.

Upvotes: 1

Related Questions