BobbyHo
BobbyHo

Reputation: 5

Some Decoding Issue With String in Python

I'm trying to write the HTML Code string from Google into file in Python 3.4

#coding=utf-8
try:
    from urllib.request import Request, urlopen  # Python 3
except:
    from urllib2 import Request, urlopen  # Python 2

useragent = 'Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0'

#Generate URL
url = 'https://www.google.com.tw/search?q='
query = str(input('Google It! :'))
full_url = url+query


#Request Data
data = Request(full_url)
data.add_header('User-Agent', useragent)
dataRequested = urlopen(data).read()
dataRequested = str(dataRequested.decode('utf-8'))


print(dataRequested)

#Write Data Into File
file = open('Google - '+query+'.html', 'w')
file.write(dataRequested)

It can print the string correctly, but when it write to file, it will show

file.write(dataRequested)
UnicodeEncodeError: 'cp950' codec can't encode character '\u200e' in position 97658: illegal multibyte sequence

I tried to change the decode way but it doesn't work. And i tried to replace \u200e too,but it will comes more encode charater error.

Upvotes: 0

Views: 355

Answers (1)

Alex Ivanov
Alex Ivanov

Reputation: 823

Your problem is

dataRequested = str(dataRequested.decode('utf-8'))

Is there a reason to convert decoded UTF-8 into a string? But that is not all. When you get a string from the Internet it should be decoded but when you save the string it should be encoded. Some guys don't get it. They either decode or encode. It doesn't work this way.

I altered your code a bit. It works fine for me on both Python2.7 and Python3.4.

dataRequested = dataRequested.decode('utf-8')


print(dataRequested)

#Write Data Into File
file = open('Google - '+query+'.html', 'wb')
file.write(dataRequested.encode('utf-8'))

Upvotes: 1

Related Questions