Removing html tags using python?

Question

I know there have probably been a million questions on this, but I'm wondering how to remove these tags without having to import or use HTMLParser or regex. I've tried a bunch of different replace statements to try and remove parts of the strings enclosed by < >'s, to no avail.

Basically what I'm working with is:

response = urlopen(url)
html = response.read()
html = html.decode()

From here I am just trying to manipulate the string variable html to do the above. Is there any way to do it as i specified, or must you use previous methods I have seen?

I also tried to make a for loop that went through every character to check if it was enclosed, but for some reason it wouldn't give me a proper print out, that was:

for i in html:
    if i == '<':
        html.replace(i, '')
        delete = True
    if i == '>':
        html.replace(i, '')
        delete = False
    if delete == True:
        html.replace(i, '')

Would appreciate any input.

ndpu · Accepted Answer

str.replace returns a copy of the string with all occurrences of substring replaced by new, you cant use it like you do and you shouldnt modify string on which your loop is iterating anyway. Using of extra list is one of the ways you can go:

txt = []
for i in html:
    if i == '<':
        delete = True
        continue
    if i == '>':
        delete = False
        continue
    if delete == True:
        continue

    txt.append(i)

now txt list contains result text, you can join it:

print ''.join(txt)

Demo:

html = 'some
text'
#...
>>> txt
['s', 'o', 'm', 'e', 't', 'e', 'x', 't']
>>> ''.join(txt)
'sometext'

Removing html tags using python?

Answers (1)

Related Questions