Reputation: 13
I am trying to scrape a site using this code
#!/usr/bin/python
#coding = utf-8
import urllib, urllib2
req = urllib2.Request(‘http://some website’)
req.add_header('User-agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')
f = urllib2.urlopen(req)
body = f.read()
f.close()
This is part of the document returned by the read() method
T\u00f3m l\u01b0\u1ee3c di\u1ec5n ti\u1ebfn Th\u01b0\u1ee3ng H\u1ed9i \u0110\u1ed3ng Gi\u00e1m M\u1ee5c v\u1ec1 Gia \u0110\u00ecnh\
How can I change the above code to get the result like this?
Tóm lược diễn tiến Thượng Hội Đồng Giám Mục về Gia Đình
Thank you.
My issue is solved by using mata's advice. Here the code that works for me. Thank you everyone for helping, especially mata.
#!/usr/bin/python
#coding = utf-8
import urllib, urllib2
req = urllib2.Request(‘http://some website’)
req.add_header('User-agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')
f = urllib2.urlopen(req)
body = f.read().decode('unicode-escape').encode('utf-8')
f.close()
Upvotes: 1
Views: 285
Reputation: 6550
You must detect encoding from page. This info, in most cases, comes in request's header.
#!/usr/bin/python
#coding = utf-8
import cgi
import urllib2
req = urllib2.Request("http://some website")
req.add_header("User-agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36")
f = urllib2.urlopen(req)
encoding = f.headers.getparam('charset') # Here, you will detect the page encoding
body = f.read().decode(encoding) # Here you will define which encode use to decode data.
f.close()
There are another ways to get same result, but I just adapted to your approach.
Upvotes: 1
Reputation: 4855
you need to detect the encoding of the page the decode it, try using this lib for the encoding detection http://github.com/chardet/chardet se the usage help and example at http://chardet.readthedocs.org/en/latest/usage.html
pip install chardet
then use it
import urllib, urllib2
import chardet #<- import this lib
req = urllib2.Request(‘http://some website’)
req.add_header('User-agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')
f = urllib2.urlopen(req)
body = f.read()
f.close()
code = chardet.detect(body) #<- detect the encoding
body = body.decode(code['encoding']) #<- decode
Upvotes: 1