adaniluk
adaniluk

Reputation: 431

Encoding in python

I have problem with comparing string from file with string I entered in the program, I should get that they are equal but no matter if i use decode('utf-8') I get that they are not equal. Here's the code:

final = open("info", 'r')
exported = open("final",'w')
lines = final.readlines()
for line in lines:
    if line == "Wykształcenie i praca": #error
    print "ok"

and how I save file that I try read:

comm_p = bs4.BeautifulSoup(comm)
comm_f.write(comm_p.prettify().encode('utf-8'))

for string in comm_p.strings:
      #print repr(string).encode('utf-8')
      save = string.encode('utf-8') #  there is how i save
      info.write(save)
      info.write("\n")        

info.close()

and at the top of file I have # -- coding: utf-8 --

Any ideas?

Upvotes: 1

Views: 333

Answers (4)

Anuj
Anuj

Reputation: 9632

use unicode for string comparision

>>> s = u'Wykształcenie i praca'
>>> s == u'Wykształcenie i praca'
True
>>>

when it comes to string unicode is the smartest move :)

Upvotes: 0

Burhan Khalid
Burhan Khalid

Reputation: 174708

This should do what you need:

# -- coding: utf-8 --
import io

with io.open('info', encoding='utf-8') as final:
    lines = final.readlines()

for line in lines:
    if line.strip() == u"Wykształcenie i praca": #error
        print "ok"

You need to open the file with the right encoding, and since your string is not ascii, you should mark it as unicode.

Upvotes: 3

Ofir
Ofir

Reputation: 8372

It is likely the difference is in a '\n' character

readlines doesn't strip '\n' - see Best method for reading newline delimited files in Python and discarding the newlines?

In general it is not a good idea to put a Unicode string in your code, it would be a good idea to read it from a resource file

Upvotes: 0

Tim Pietzcker
Tim Pietzcker

Reputation: 336468

First, you need some basic knowledge about encodings. This is a good place to start. You don't have to read everything right now, but try to get as far as you can.

About your current problem:

You're reading a UTF-8 encoded file (probably), but you're reading it as an ASCII file. open() doesn't do any conversion for you.

So what you need to do (at least):

  • use codecs.open("info", "r", encoding="utf-8") to read the file
  • use Unicode strings for comparison: if line.rstrip() == u"Wykształcenie i praca":

Upvotes: 0

Related Questions