Reputation: 81
I built this little RSS reader a while ago for myself and I felt inspired to update it to exclude junk from description tag's. Im busy testing it out now to remove &'lt; (all content) &'gt; from the description tags and Im having trouble getting this rite.
So far my code looks something like this
from re import findall
from Tkinter import *
from urllib import urlopen
disc = []
URL = 'http://feeds.sciencedaily.com/sciencedaily/matter_energy/engineering?format=xml'
O_W = urlopen(URL).read()
disc_ex = findall('<description>(.*)</description>',O_W)
for i in disc_ex:
new_disc = i.replace(findall('<(.*)>',i),'')
disc.extend([new_disc])
So prior to the new_disc line of code on my attempt to remove some of the rubbish text I would normally get my text to come through looking like this
"Tailored DNA structures could find targeted cells and release their molecular payload selectively into the cells.<img src="http://feeds.feedburner.com/~r/sciencedaily/matter_energy/engineering/~4/J1bTggGxFOY" height="1" width="1" alt=""/>"
What I want is just the text without the rubbish, so essentially just:
"Tailored DNA structures could find targeted cells and release their molecular payload selectively into the cells."
Any suggestions for me?
Upvotes: 1
Views: 98
Reputation: 926
There are several solutions, BeautifulSoup for example. To follow your idea, avoid strings within '<' ...'>' brackets just change last line:
...
for i in disc_ex:
new_disc = i.replace(findall('<(.*)>',i),'')
disc.extend([re.sub(r'<(.*)/>','',new_disc)])
Upvotes: 1