tw0fifths
tw0fifths

Reputation: 37

Stripping chars in python

I know this is not very pretty code and that I'm sure there is an easier way but I'm more concerned on why python is not stripping the characters I requested it to.

import urllib, sgmllib


zip_code = raw_input('Give me a zip code> ')
url = 'http://www.uszip.com/zip/' + zip_code
print url

conn = urllib.urlopen('http://www.uszip.com/zip/' + zip_code)

i = 0
while i < 1000:
    for line in conn.fp:
            if i == 1:
                print line[7:-10]
                i += 1
            elif i == 344:
                line1 = line.strip()
                line2 = line1.strip('<td>') #its not stripping the characters 
                print line2[17:-60]
                i += 1
            else:
            i += 1

Upvotes: 1

Views: 138

Answers (3)

Niklas B.
Niklas B.

Reputation: 95358

The way you call it, it should remove any occurrence of the <, >, t, and d characters, and only at the beginning or end of the string:

>>> '<p>some test</p>'.strip('<td>')
'p>some test</p'

If you want to remove every occurrence of the substring <td>, use replace:

>>> '<td>some test</td>'.replace('<td>', '')
'some test</td>'

Note that if you want to use that for some kind of input sanitization, it can be easily circumvented:

>>> '<td<td>>some test</td>'.replace('<td>', '')
'<td>some test</td>'

This is only one of many ways how people typically get screwed if they try to write their own HTML parsing code, so maybe you rather want to use a HTML parsing library like BeautifulSoup or an XML parser like lxml.

Upvotes: 4

Winston Ewert
Winston Ewert

Reputation: 45079

            line2 = line1.strip('<td>') #its not stripping the characters 

It doesn't strip the string <td>, rather it strips the characters in the string. So it'll strip away < and > and t and d, at the beginning and end of the string.

However, in general, that's a poor way to try and extract data from a web page. Look into BeautifulSoup for a better approac.

Upvotes: 3

macduff
macduff

Reputation: 4685

Parameters:

Here is the detail of parameters:

chars: characters to be removed from beginning or end of the string.

Looks like it needs to be only at the beginning or ending of the string. Otherwise, I would recommend using a regex.

Upvotes: 0

Related Questions