Reputation: 111
I'm parsing a html table using BS4 in python. Everything works fine and I'm able do identify all the elements that i need and print they. But the program stops working then I try to write the results into a text file. I get this error:
"UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 13: ordinal not in range(128)"
I have tried to use .encode('utf-8') in the writing command but I get something like this written : 31.61 
Here's what I'm running. I used code structure to parse another table and it worked. I appreciate if anyone can point me in the right direction.
from threading import Thread
import urllib2
import re
from bs4 import BeautifulSoup
url = "http://trackinfo.com/dog-racelines.jsp?page=1&runnername=Ww%20Gloriaestefan"
myfile = open('base/basei/' + url[57:].replace("%20", " ").replace("%27","'") + '.txt','w+')
soup = BeautifulSoup(urllib2.urlopen(url).read())
for tr in soup.find_all('tr')[0:]:
tds = tr.find_all('td')
if len(tds) >=0:
print tds[0].text, ",", tds[4].text, ",", tds[7].text, ",", tds[12].text, ",", tds[14].text, ",", tds[17].text
myfile.write(tds[0].text + ','+ tds[4].text + "," + tds[7].text + "," + tds[12].text + "," + tds[14].text + "," + tds[17].text)
myfile.close()
Upvotes: 0
Views: 1572
Reputation: 886
Code below works for me. I replaced the non-breaking space with a comma; this way you can use the output directly as a CSV (e.g. you can easily read into Excel or LibreOffice Calc).
import urllib2
from bs4 import BeautifulSoup
url = "http://trackinfo.com/dog-racelines.jsp?page=1&runnername=Ww%20Gloriaestefan"
soup = BeautifulSoup(urllib2.urlopen(url).read())
with open('out.txt', 'w') as myfile:
for tr in soup.find_all('tr')[0:]:
tds = tr.find_all('td')
if len(tds) >= 0:
stripped_tds = [tds[x].text.strip() for x in (0, 4, 7, 12, 14, 17)]
out = ','.join(stripped_tds)
out = out.replace(u'\xa0', ',')
print out
myfile.write(out + '\n')
(The with
statement removes the need to explicitly call myfile.close()
. It implicitly does this when the section of code inside the with
is complete, even if it encounters an exception there.)
Content of out.txt
:
2014-04-15,E5,31.28,7,6,32.18,C
2014-04-13,E6,31.07,2,4,31.64,B
2014-04-11,E6,31.21,6,6,32.53,B
2014-04-07,E7,30.93,5,7,32.31,B
2014-04-03,S1,30.82,3,2,31.23,
2014-03-30,E9,31.02,3,8,31.97,A
2014-03-28,E9,30.95,7,8,31.85,A
2014-03-23,E9,30.88,8,8,32.06,A
2014-03-21,E6,30.83,1,1,30.83,SB
2014-03-17,E5,31.14,1,1,31.14,C
2014-03-15,E5,31.00,4,4,31.62,C
2014-03-10,E3,31.46,4,1,31.46,D
2014-03-08,A3,31.79,4,5,32.23,D
2014-03-03,A6,31.20,3,5,31.81,D
2014-03-01,E3,31.61,3,3,31.88,D
Upvotes: 1