taga
taga

Reputation: 3885

How to extract RSS links from website with Python

I am trying to extract all RSS feed links from some websites. Ofc if RSS exists. These are some website links that have RSS, and below is list of RSS links from those websites.

website_links = ["https://www.diepresse.com/", 
"https://www.sueddeutsche.de/", 
"https://www.berliner-zeitung.de/", 
"https://www.aargauerzeitung.ch/", 
"https://www.luzernerzeitung.ch/", 
"https://www.nzz.ch/",
"https://www.spiegel.de/", 
"https://www.blick.ch/",
"https://www.berliner-zeitung.de/", 
"https://www.ostsee-zeitung.de/", 
"https://www.kleinezeitung.at/", 
"https://www.blick.ch/", 
"https://www.ksta.de/", 
"https://www.tagblatt.ch/", 
"https://www.srf.ch/", 
"https://www.derstandard.at/"]


website_rss_links = ["https://www.diepresse.com/rss/Kunst", 
"https://rss.sueddeutsche.de/rss/Kultur", 
"https://www.berliner-zeitung.de/feed.id_kultur-kunst.xml", 
"https://www.aargauerzeitung.ch/leben-kultur.rss", 
"https://www.luzernerzeitung.ch/kultur.rss", 
"https://www.nzz.ch/technologie.rss", 
"https://www.spiegel.de/kultur/literatur/index.rss", 
"https://www.luzernerzeitung.ch/wirtschaft.rss", 
"https://www.blick.ch/wirtschaft/rss.xml", 
"https://www.berliner-zeitung.de/feed.id_abgeordnetenhauswahl.xml", 
"https://www.ostsee-zeitung.de/arc/outboundfeeds/rss/category/wissen/", 
"https://www.kleinezeitung.at/rss/politik", 
"https://www.blick.ch/wirtschaft/rss.xml", 
"https://feed.ksta.de/feed/rss/politik/index.rss", 
"https://www.tagblatt.ch/wirtschaft.rss", 
"https://www.srf.ch/news/bnf/rss/1926", 
"https://www.derstandard.at/rss/wirtschaft"]

My approach is to extract all links, and then check if some of them has RSS in them, but that is just a first step:

for url in all_links:
    
    response = requests.get(url)
    print(response)
    soup = BeautifulSoup(response.content, 'html.parser')
    list_of_links = soup.select("a[href]")
    list_of_links = [link["href"] for link in list_of_links]
    print("Number of links", len(list_of_links))
 

    for l in list_of_links:
        if "rss" in l:
            print(url)
            print(l)
    print()
    

I have heard that I can look for RSS links like this, but I do not know how to incorporate this in my code.

type=application/rss+xml

My goal is to get working RSS urls at the end. Maybe it is an issue because I am sending request on the first page, and maybe I should crawl different pages in order to extract all RSS Links, but I hope that there is a faster/better way for RSS extraction.

You can see that RSS links have or end up with (for example):

.rss
/rss
/rss/
rss.xml
/feed/
rss-feed

etc.

Upvotes: 5

Views: 1290

Answers (2)

Ryabchenko Alexander
Ryabchenko Alexander

Reputation: 12390

search for type="application/rss+xml" links

like

<link href="/feeds" rel="alternate" title="RSS feed" type="application/rss+xml">

<link rel="alternate" type="application/rss+xml" title="DER SPIEGEL | RSS Schlagzeilen" href="https://www.spiegel.de/schlagzeilen/index.rss">

<link rel="alternate" type="application/rss+xml" title="DER SPIEGEL | RSS Nachrichten" href="https://www.spiegel.de/index.rss">

Upvotes: 0

HedgeHog
HedgeHog

Reputation: 25073

Don't reinvent the wheel, there are many curated directories and collections that can serve you well and give you a nice introduction.

However, to follow your approach, you should first collect all the links on the page that could point to an rss feed:

soup.select('a[href*="rss"],a[href*="/feed"],a:-soup-contains-own("RSS")')

and then verify again whether it is one or just a collection page:

soup.select('[type="application/rss+xml"],a[href*=".rss"]')

or checking the content-type:

if 'xml' in requests.get(rss).headers.get('content-type'):

Note: This is just to point in a direction, cause there a lot of pattern that are used to mark such feeds - rss, feed, feed/, news, xml,... and also the content-type is provided differently by servers

Example

import requests, re
from bs4 import BeautifulSoup

website_links = ["https://www.diepresse.com/", 
"https://www.sueddeutsche.de/", 
"https://www.berliner-zeitung.de/", 
"https://www.aargauerzeitung.ch/", 
"https://www.luzernerzeitung.ch/", 
"https://www.nzz.ch/technologie/",
"https://www.spiegel.de/", 
"https://www.blick.ch/wirtschaft/"]

rss_feeds = []

def check_for_real_rss(url):
    base_url = re.search('^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/\n]+)',url).group(0)
    r = requests.get(url)
    soup = BeautifulSoup(r.text)
    for e in soup.select('[type="application/rss+xml"],a[href*=".rss"],a[href$="feed"]'):
        if e.get('href').startswith('/'):
            rss = (base_url+e.get('href'))
        else:
            rss = (e.get('href'))
        if 'xml' in requests.get(rss).headers.get('content-type'):
            rss_feeds.append(rss)

for url in website_links:
    soup = BeautifulSoup(requests.get(url).text)
    for e in soup.select('a[href*="rss"],a[href*="/feed"],a:-soup-contains-own("RSS")'):
        if e.get('href').startswith('/'):
            check_for_real_rss(url.strip('/')+e.get('href'))
        else:
            check_for_real_rss(e.get('href'))
set(rss_feeds)

Output

{'https://rss.sueddeutsche.de/app/service/rss/alles/index.rss?output=rss','https://rss.sueddeutsche.de/rss/Topthemen',
 'https://www.aargauerzeitung.ch/aargau/aarau.rss',
 'https://www.aargauerzeitung.ch/aargau/baden.rss',
 'https://www.aargauerzeitung.ch/leben-kultur.rss',
 'https://www.aargauerzeitung.ch/schweiz-welt.rss',
 'https://www.aargauerzeitung.ch/sport.rss',
 'https://www.bzbasel.ch/basel.rss',
 'https://www.grenchnertagblatt.ch/solothurn/grenchen.rss',
 'https://www.jetzt.de/alle_artikel.rss',
 'https://www.limmattalerzeitung.ch/limmattal.rss',
 'https://www.luzernerzeitung.ch/international.rss',
 'https://www.luzernerzeitung.ch/kultur.rss',
 'https://www.luzernerzeitung.ch/leben.rss',
 'https://www.luzernerzeitung.ch/leben/ratgeber.rss',...}

Upvotes: 8

Related Questions