Reputation: 125
I know there have probably been a million questions on this, but I'm wondering how to remove these tags without having to import or use HTMLParser or regex. I've tried a bunch of different replace statements to try and remove parts of the strings enclosed by < >'s, to no avail.
Basically what I'm working with is:
response = urlopen(url)
html = response.read()
html = html.decode()
From here I am just trying to manipulate the string variable html to do the above. Is there any way to do it as i specified, or must you use previous methods I have seen?
I also tried to make a for loop that went through every character to check if it was enclosed, but for some reason it wouldn't give me a proper print out, that was:
for i in html:
if i == '<':
html.replace(i, '')
delete = True
if i == '>':
html.replace(i, '')
delete = False
if delete == True:
html.replace(i, '')
Would appreciate any input.
Upvotes: 2
Views: 644
Reputation: 22561
str.replace
returns a copy of the string with all occurrences of substring replaced by new, you cant use it like you do and you shouldnt modify string on which your loop is iterating anyway. Using of extra list is one of the ways you can go:
txt = []
for i in html:
if i == '<':
delete = True
continue
if i == '>':
delete = False
continue
if delete == True:
continue
txt.append(i)
now txt
list contains result text, you can join it:
print ''.join(txt)
Demo:
html = '<body><div>some</div><div>text</div></body>'
#...
>>> txt
['s', 'o', 'm', 'e', 't', 'e', 'x', 't']
>>> ''.join(txt)
'sometext'
Upvotes: 1