Reputation: 5
I'm trying to write the HTML Code string from Google into file in Python 3.4
#coding=utf-8
try:
from urllib.request import Request, urlopen # Python 3
except:
from urllib2 import Request, urlopen # Python 2
useragent = 'Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0'
#Generate URL
url = 'https://www.google.com.tw/search?q='
query = str(input('Google It! :'))
full_url = url+query
#Request Data
data = Request(full_url)
data.add_header('User-Agent', useragent)
dataRequested = urlopen(data).read()
dataRequested = str(dataRequested.decode('utf-8'))
print(dataRequested)
#Write Data Into File
file = open('Google - '+query+'.html', 'w')
file.write(dataRequested)
It can print the string correctly, but when it write to file, it will show
file.write(dataRequested)
UnicodeEncodeError: 'cp950' codec can't encode character '\u200e' in position 97658: illegal multibyte sequence
I tried to change the decode way but it doesn't work. And i tried to replace \u200e too,but it will comes more encode charater error.
Upvotes: 0
Views: 355
Reputation: 823
Your problem is
dataRequested = str(dataRequested.decode('utf-8'))
Is there a reason to convert decoded UTF-8 into a string? But that is not all. When you get a string from the Internet it should be decoded but when you save the string it should be encoded. Some guys don't get it. They either decode or encode. It doesn't work this way.
I altered your code a bit. It works fine for me on both Python2.7 and Python3.4.
dataRequested = dataRequested.decode('utf-8')
print(dataRequested)
#Write Data Into File
file = open('Google - '+query+'.html', 'wb')
file.write(dataRequested.encode('utf-8'))
Upvotes: 1