Reputation: 196
I am trying to build an RSS-Parser that will check each title for keywords. So I only get the feeds that I am interested in. Thus far I am able to get the titles using regex. But I am unsure on how to proceed. I would like to check the titles for multiple keywords, so it would be best to load them from a .txt file. I only want those titles to be printed out that have a positive match. Can someone point me in the right direction?
My code so far:
import urllib2
from urllib2 import urlopen
import re
import cookielib
from cookielib import CookieJar
import time
# -*- coding: utf-8 -*-
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
def main():
try:
page = 'http://randomdomainXYZ.com/news-feed.xml'
sourceCode = opener.open(page).read()
#print sourceCode
try:
titles = re.findall(r'<title>(.*?)</title>', sourceCode)
for title in titles:
print title
except Exception, e:
print str(e)
except Exception, e:
print str(e)
main()
Upvotes: 1
Views: 464
Reputation: 2318
So, you want to print titles that contain one of the words in some list. Try:
for title in titles:
if any(word in title for word in word_list):
print title
As for reading your word list, you can read all the lines in a file with:
with open('word_list.txt') as f:
word_list = f.readlines()
# Make sure words don't end with a newline character ('\n')
word_list = [word.strip() for word in word_list]
Upvotes: 1