Reputation: 29
Here is my full code, and it is working fine with ASCII, but when comes the "unicode" charaters in the picture... I hate my life...
I know this is not english, but let me explain:
I have got 2 input files (realmek, nevek), and 1 result file (osszes).
I have got a working page in (html).
BUT when I try use strange chracters: "űáéđĐ" I need to save 2 input, and 1 output files in UNICODE. But than my program drops a "encoding decoding" error. And I know it is normal.
So my question is: How can I solve this? where I need to handle decoding encoding?
I am thinking about this for 3 days... I tried many decoding, like "u = unicode( s, "utf-8" )" ; $ export LANG=en_US.UTF-8; etc. But it didn't worked.
from urllib import urlopen
import re
faj = "hiba"
cast = "hiba"
pont = 0
szint = 0
fj = open("C:\Users\Rendszergazda\Desktop\Achievements\Realmek.txt", "r")
tombr = fj.readline()
realmek = tombr.split(" ")
fj.close()
fh = open("C:\Users\Rendszergazda\Desktop\Achievements\Nevek.txt", "r")
tomb = fh.readline()
nevek = tomb.split(" ")
fh.close()
osszes = open("C:\Users\Rendszergazda\Desktop\Achievements\Osszes.txt", "a")
for x in realmek:
realm = x
for y in nevek:
nev = y
lap = urlopen("http://eu.battle.net/wow/en/character/"+str(realm)+"/"+str(nev)+"/achievement").read()
letezik = re.compile('<div id="server-erro(.*)">')
letez = re.findall(letezik,lap)
if (letez != []):
a = 0
else:
lapn = lap.split("\n")
mapo = lapn[1087]
pontos = re.compile('\t\t\t\t\t(.*)\r')
pont = re.findall(pontos,mapo)
mapom = lapn[1322]
feastn = re.compile('<div class="bar-contents">\t\t\t\t\t\t\t\t\t\t\t\t(.*)\r')
feast = re.findall(feastn,mapom)
fajkeres = re.compile('</strong></span> <a href="/wow/en/game/race/(.*)" class="race">')
castkeres = re.compile('</a> <a href="/wow/en/game/class/(.*)" class="class">')
szintkeres = re.compile('<span class="level"><strong>(.*)</strong></span> <a href="/wow/en/game/')
faj = re.findall(fajkeres,lap)
cast = re.findall(castkeres,lap)
szint = re.findall(szintkeres,lap)
link = "http://eu.battle.net/wow/en/character/"+str(realm)+"/"+str(nev)+"/advanced"
ccast = cast [0]
ffaj = faj [0]
sszint = szint [0]
ppont = pont [0]
ffeast = feast [0]
osszes.write(str(nev)+" "+str(realm)+" "+str(ppont)+" "+str(ffeast)+" "+str(ffaj)+" "+str(ccast)+" "+str(sszint)+" "+str(link)+"\n")
osszes.close()
Upvotes: 0
Views: 449
Reputation: 50220
Instead of plain open, use codecs.open
to read and write your file. They take an optional argument that you use to specify what encoding to use. Ensure that you can read, print and write non-ascii text correctly (it will be seen as unicode inside your script), and afterwards check whether you're using any regexps that need adjustment.
Also, if you're using any non-ascii characters in your python source, declare your script's encoding by adding something like this as the first or second line:
# -*- coding: utf-8 -*-
Upvotes: 2