Python regex to find and regex to remove from list

Question

I built this little RSS reader a while ago for myself and I felt inspired to update it to exclude junk from description tag's. Im busy testing it out now to remove &'lt; (all content) &'gt; from the description tags and Im having trouble getting this rite.

So far my code looks something like this

from re import findall
from Tkinter import *
from urllib import urlopen

disc = []
URL = 'http://feeds.sciencedaily.com/sciencedaily/matter_energy/engineering?format=xml'
O_W = urlopen(URL).read()

disc_ex = findall('(.*)',O_W)
for i in disc_ex:
    new_disc = i.replace(findall('<(.*)>',i),'')
    disc.extend([new_disc])

So prior to the new_disc line of code on my attempt to remove some of the rubbish text I would normally get my text to come through looking like this

"Tailored DNA structures could find targeted cells and release their molecular payload selectively into the cells.<img src="http://feeds.feedburner.com/~r/sciencedaily/matter_energy/engineering/~4/J1bTggGxFOY" height="1" width="1" alt=""/>"

What I want is just the text without the rubbish, so essentially just:

"Tailored DNA structures could find targeted cells and release their molecular payload selectively into the cells."

Any suggestions for me?

mmachine · Accepted Answer

There are several solutions, BeautifulSoup for example. To follow your idea, avoid strings within '<' ...'>' brackets just change last line:

...
for i in disc_ex:
    new_disc = i.replace(findall('<(.*)>',i),'')
    disc.extend([re.sub(r'<(.*)/>','',new_disc)])

Python regex to find and regex to remove from list

Answers (1)

Related Questions