GenGen
GenGen

Reputation: 81

Python regex to find and regex to remove from list

I built this little RSS reader a while ago for myself and I felt inspired to update it to exclude junk from description tag's. Im busy testing it out now to remove &'lt; (all content) &'gt; from the description tags and Im having trouble getting this rite.

So far my code looks something like this

from re import findall
from Tkinter import *
from urllib import urlopen

disc = []
URL = 'http://feeds.sciencedaily.com/sciencedaily/matter_energy/engineering?format=xml'
O_W = urlopen(URL).read()

disc_ex = findall('<description>(.*)</description>',O_W)
for i in disc_ex:
    new_disc = i.replace(findall('&lt;(.*)&gt;',i),'')
    disc.extend([new_disc])

So prior to the new_disc line of code on my attempt to remove some of the rubbish text I would normally get my text to come through looking like this

"Tailored DNA structures could find targeted cells and release their molecular payload selectively into the cells.&lt;img src="http://feeds.feedburner.com/~r/sciencedaily/matter_energy/engineering/~4/J1bTggGxFOY" height="1" width="1" alt=""/&gt;"

What I want is just the text without the rubbish, so essentially just:

"Tailored DNA structures could find targeted cells and release their molecular payload selectively into the cells."

Any suggestions for me?

Upvotes: 1

Views: 98

Answers (1)

mmachine
mmachine

Reputation: 926

There are several solutions, BeautifulSoup for example. To follow your idea, avoid strings within '<' ...'>' brackets just change last line:

...
for i in disc_ex:
    new_disc = i.replace(findall('&lt;(.*)&gt;',i),'')
    disc.extend([re.sub(r'<(.*)/>','',new_disc)])

Upvotes: 1

Related Questions