Reputation: 37
I know this is not very pretty code and that I'm sure there is an easier way but I'm more concerned on why python is not stripping the characters I requested it to.
import urllib, sgmllib
zip_code = raw_input('Give me a zip code> ')
url = 'http://www.uszip.com/zip/' + zip_code
print url
conn = urllib.urlopen('http://www.uszip.com/zip/' + zip_code)
i = 0
while i < 1000:
for line in conn.fp:
if i == 1:
print line[7:-10]
i += 1
elif i == 344:
line1 = line.strip()
line2 = line1.strip('<td>') #its not stripping the characters
print line2[17:-60]
i += 1
else:
i += 1
Upvotes: 1
Views: 138
Reputation: 95358
The way you call it, it should remove any occurrence of the <
, >
, t
, and d
characters, and only at the beginning or end of the string:
>>> '<p>some test</p>'.strip('<td>')
'p>some test</p'
If you want to remove every occurrence of the substring <td>
, use replace
:
>>> '<td>some test</td>'.replace('<td>', '')
'some test</td>'
Note that if you want to use that for some kind of input sanitization, it can be easily circumvented:
>>> '<td<td>>some test</td>'.replace('<td>', '')
'<td>some test</td>'
This is only one of many ways how people typically get screwed if they try to write their own HTML parsing code, so maybe you rather want to use a HTML parsing library like BeautifulSoup
or an XML parser like lxml
.
Upvotes: 4
Reputation: 45079
line2 = line1.strip('<td>') #its not stripping the characters
It doesn't strip the string <td>
, rather it strips the characters in the string. So it'll strip away < and > and t and d, at the beginning and end of the string.
However, in general, that's a poor way to try and extract data from a web page. Look into BeautifulSoup for a better approac.
Upvotes: 3
Reputation: 4685
Parameters:
Here is the detail of parameters:
chars: characters to be removed from beginning or end of the string.
Looks like it needs to be only at the beginning or ending of the string. Otherwise, I would recommend using a regex.
Upvotes: 0