blackmamba
blackmamba

Reputation: 1992

Find rss links in a webpage using regex

I am trying to find rss links in a website. But my code returns img src and css links as well because it's src contains rss word.

This is my code:

import urllib2
import re

website = urllib2.urlopen("http://www.apple.com/rss")
html = website.read()
links = re.findall('"((http)s?://.*rss.*)"',html)
for link in links:
print link 

Upvotes: 0

Views: 705

Answers (2)

Sabuj Hassan
Sabuj Hassan

Reputation: 39375

## removing from top
html = re.sub('.*?<div id="container">', "", html)

## remove from bottom
html = re.sub('<div class="callout">.*', "", html)

## then match
links = re.findall('<li[^>]*>\s*<a href="(https?://[^"]*)"', html, re.IGNORECASE)
## you can push the text rss inside the pattern if you want

Upvotes: 1

ChrisP
ChrisP

Reputation: 5942

I don't recommend parsing HTML with a regular expression. There are better tools for finding links on web pages. My favorite is lxml.

import lxml.html
root = lxml.html.fromstring(html)
links = root.iterlinks()
links.next()

The above will allow you to iterate over each link. You then need to infer whether the link refers to an RSS feed. Here are some ways you might do this...

  • Look for keywords related to RSS in the url
  • Make a request and check the response type (application/rss+xml)

Without actually checking the server response, you won't know whether something is RSS. A URL like http://www.example.com/f might be an RSS feed. You can't know for sure until you check.

Upvotes: 0

Related Questions