McLinux
McLinux

Reputation: 263

HTML Decoding in Python

I am writing a python script for mass-replacement of links(actually image and script sources) in HTML files; I am using lxml. There is one problem, the html files are quizzes and they have data packaged like this(there is also some Cyrillic here):

<input class="question_data" value="{&quot;text&quot;:&quot;&lt;p&gt;[1] је наука која се бави чувањем, обрадом и преносом информација помоћу рачунара.&lt;/p&gt;&quot;,&quot;fields&quot;:[{&quot;id&quot;:&quot;1&quot;,&quot;type&quot;:&quot;fill&quot;,&quot;element&quot;:{&quot;sirina&quot;:&quot;103&quot;,&quot;maxDuzina&quot;:&quot;12&quot;,&quot;odgovor&quot;:[&quot;Информатика&quot;]}}]}" name="question:1:data" id="id3a1"/>

When I try to print out this data in python using:

print "OLD_DATA:", data

It just prints out the error "UnicodeEncodeError: character maps to undefined". There are more of these elements. My goal is to change the links of images in the value part of input, but I can't change the links if I don't know how to print this data(or how it should be written to the file). How does Python handle(interpret) this? Please help. Thanks!!! :)

Upvotes: 0

Views: 114

Answers (1)

Ketzak
Ketzak

Reputation: 628

You're running into the same problem I've hit many times in the past. That error almost always means that the console environment you're using can't display the characters it's trying to print. It might be worth trying to log to a file instead, then opening the log in an editor that can display the characters.

If you really want to be able to see it on your console, it might be worth writing a function to screen the strings you're printing for unprintable characters

I also found a couple other StackOverflow posts that might be helpful in your efforts:

How do I get Cyrillic in the output, Python?

What is right way to use cyrillic in python lxml library

I would also recommend this article and python manual entry:

https://docs.python.org/2/howto/unicode.html

http://www.joelonsoftware.com/articles/Unicode.html

Upvotes: 1

Related Questions