How to extract RSS links from website with Python

Question

I am trying to extract all RSS feed links from some websites. Ofc if RSS exists. These are some website links that have RSS, and below is list of RSS links from those websites.

website_links = ["https://www.diepresse.com/", 
"https://www.sueddeutsche.de/", 
"https://www.berliner-zeitung.de/", 
"https://www.aargauerzeitung.ch/", 
"https://www.luzernerzeitung.ch/", 
"https://www.nzz.ch/",
"https://www.spiegel.de/", 
"https://www.blick.ch/",
"https://www.berliner-zeitung.de/", 
"https://www.ostsee-zeitung.de/", 
"https://www.kleinezeitung.at/", 
"https://www.blick.ch/", 
"https://www.ksta.de/", 
"https://www.tagblatt.ch/", 
"https://www.srf.ch/", 
"https://www.derstandard.at/"]


website_rss_links = ["https://www.diepresse.com/rss/Kunst", 
"https://rss.sueddeutsche.de/rss/Kultur", 
"https://www.berliner-zeitung.de/feed.id_kultur-kunst.xml", 
"https://www.aargauerzeitung.ch/leben-kultur.rss", 
"https://www.luzernerzeitung.ch/kultur.rss", 
"https://www.nzz.ch/technologie.rss", 
"https://www.spiegel.de/kultur/literatur/index.rss", 
"https://www.luzernerzeitung.ch/wirtschaft.rss", 
"https://www.blick.ch/wirtschaft/rss.xml", 
"https://www.berliner-zeitung.de/feed.id_abgeordnetenhauswahl.xml", 
"https://www.ostsee-zeitung.de/arc/outboundfeeds/rss/category/wissen/", 
"https://www.kleinezeitung.at/rss/politik", 
"https://www.blick.ch/wirtschaft/rss.xml", 
"https://feed.ksta.de/feed/rss/politik/index.rss", 
"https://www.tagblatt.ch/wirtschaft.rss", 
"https://www.srf.ch/news/bnf/rss/1926", 
"https://www.derstandard.at/rss/wirtschaft"]

My approach is to extract all links, and then check if some of them has RSS in them, but that is just a first step:

for url in all_links:
    
    response = requests.get(url)
    print(response)
    soup = BeautifulSoup(response.content, 'html.parser')
    list_of_links = soup.select("a[href]")
    list_of_links = [link["href"] for link in list_of_links]
    print("Number of links", len(list_of_links))
 

    for l in list_of_links:
        if "rss" in l:
            print(url)
            print(l)
    print()

I have heard that I can look for RSS links like this, but I do not know how to incorporate this in my code.

type=application/rss+xml

My goal is to get working RSS urls at the end. Maybe it is an issue because I am sending request on the first page, and maybe I should crawl different pages in order to extract all RSS Links, but I hope that there is a faster/better way for RSS extraction.

You can see that RSS links have or end up with (for example):

.rss
/rss
/rss/
rss.xml
/feed/
rss-feed

etc.

HedgeHog · Accepted Answer

Don't reinvent the wheel, there are many curated directories and collections that can serve you well and give you a nice introduction.

However, to follow your approach, you should first collect all the links on the page that could point to an rss feed:

soup.select('a[href*="rss"],a[href*="/feed"],a:-soup-contains-own("RSS")')

and then verify again whether it is one or just a collection page:

soup.select('[type="application/rss+xml"],a[href*=".rss"]')

or checking the content-type:

if 'xml' in requests.get(rss).headers.get('content-type'):

Note: This is just to point in a direction, cause there a lot of pattern that are used to mark such feeds - rss, feed, feed/, news, xml,... and also the content-type is provided differently by servers

Example

import requests, re
from bs4 import BeautifulSoup

website_links = ["https://www.diepresse.com/", 
"https://www.sueddeutsche.de/", 
"https://www.berliner-zeitung.de/", 
"https://www.aargauerzeitung.ch/", 
"https://www.luzernerzeitung.ch/", 
"https://www.nzz.ch/technologie/",
"https://www.spiegel.de/", 
"https://www.blick.ch/wirtschaft/"]

rss_feeds = []

def check_for_real_rss(url):
    base_url = re.search('^(?:https?://)?(?:[^@/
]+@)?(?:www\.)?([^:/
]+)',url).group(0)
    r = requests.get(url)
    soup = BeautifulSoup(r.text)
    for e in soup.select('[type="application/rss+xml"],a[href*=".rss"],a[href$="feed"]'):
        if e.get('href').startswith('/'):
            rss = (base_url+e.get('href'))
        else:
            rss = (e.get('href'))
        if 'xml' in requests.get(rss).headers.get('content-type'):
            rss_feeds.append(rss)

for url in website_links:
    soup = BeautifulSoup(requests.get(url).text)
    for e in soup.select('a[href*="rss"],a[href*="/feed"],a:-soup-contains-own("RSS")'):
        if e.get('href').startswith('/'):
            check_for_real_rss(url.strip('/')+e.get('href'))
        else:
            check_for_real_rss(e.get('href'))
set(rss_feeds)

Output

{'https://rss.sueddeutsche.de/app/service/rss/alles/index.rss?output=rss','https://rss.sueddeutsche.de/rss/Topthemen',
 'https://www.aargauerzeitung.ch/aargau/aarau.rss',
 'https://www.aargauerzeitung.ch/aargau/baden.rss',
 'https://www.aargauerzeitung.ch/leben-kultur.rss',
 'https://www.aargauerzeitung.ch/schweiz-welt.rss',
 'https://www.aargauerzeitung.ch/sport.rss',
 'https://www.bzbasel.ch/basel.rss',
 'https://www.grenchnertagblatt.ch/solothurn/grenchen.rss',
 'https://www.jetzt.de/alle_artikel.rss',
 'https://www.limmattalerzeitung.ch/limmattal.rss',
 'https://www.luzernerzeitung.ch/international.rss',
 'https://www.luzernerzeitung.ch/kultur.rss',
 'https://www.luzernerzeitung.ch/leben.rss',
 'https://www.luzernerzeitung.ch/leben/ratgeber.rss',...}

How to extract RSS links from website with Python

Answers (2)

Example

Output

Related Questions