eLudium
eLudium

Reputation: 196

Check RSS feed titles for words. Print only titles that contain word

I am trying to build an RSS-Parser that will check each title for keywords. So I only get the feeds that I am interested in. Thus far I am able to get the titles using regex. But I am unsure on how to proceed. I would like to check the titles for multiple keywords, so it would be best to load them from a .txt file. I only want those titles to be printed out that have a positive match. Can someone point me in the right direction?

My code so far:

import urllib2
from urllib2 import urlopen
import re
import cookielib
from cookielib import CookieJar
import time
# -*- coding: utf-8 -*-

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

def main():
    try:
        page = 'http://randomdomainXYZ.com/news-feed.xml'
        sourceCode = opener.open(page).read()
        #print sourceCode

        try:
            titles = re.findall(r'<title>(.*?)</title>', sourceCode)
            for title in titles:
                print title

        except Exception, e:
            print str(e)

    except Exception, e:
        print str(e)

main()

Upvotes: 1

Views: 464

Answers (1)

Zachary Cross
Zachary Cross

Reputation: 2318

So, you want to print titles that contain one of the words in some list. Try:

for title in titles:
    if any(word in title for word in word_list):
        print title

As for reading your word list, you can read all the lines in a file with:

with open('word_list.txt') as f:
    word_list = f.readlines()

# Make sure words don't end with a newline character ('\n')
word_list = [word.strip() for word in word_list]  

Upvotes: 1

Related Questions