Reputation: 1992
I am trying to find rss links in a website. But my code returns img src and css links as well because it's src contains rss word.
This is my code:
import urllib2
import re
website = urllib2.urlopen("http://www.apple.com/rss")
html = website.read()
links = re.findall('"((http)s?://.*rss.*)"',html)
for link in links:
print link
Upvotes: 0
Views: 705
Reputation: 39375
## removing from top
html = re.sub('.*?<div id="container">', "", html)
## remove from bottom
html = re.sub('<div class="callout">.*', "", html)
## then match
links = re.findall('<li[^>]*>\s*<a href="(https?://[^"]*)"', html, re.IGNORECASE)
## you can push the text rss inside the pattern if you want
Upvotes: 1
Reputation: 5942
I don't recommend parsing HTML with a regular expression. There are better tools for finding links on web pages. My favorite is lxml
.
import lxml.html
root = lxml.html.fromstring(html)
links = root.iterlinks()
links.next()
The above will allow you to iterate over each link. You then need to infer whether the link refers to an RSS feed. Here are some ways you might do this...
application/rss+xml
)Without actually checking the server response, you won't know whether something is RSS. A URL like http://www.example.com/f
might be an RSS feed. You can't know for sure until you check.
Upvotes: 0