H123
H123

Reputation: 13

Python - scraping website with unicode

I am trying to scrape a site using this code

    #!/usr/bin/python
    #coding = utf-8
    import urllib, urllib2
    req = urllib2.Request(‘http://some website’)
    req.add_header('User-agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')
    f = urllib2.urlopen(req) 
    body = f.read()
    f.close()

This is part of the document returned by the read() method

    T\u00f3m l\u01b0\u1ee3c di\u1ec5n ti\u1ebfn Th\u01b0\u1ee3ng H\u1ed9i \u0110\u1ed3ng Gi\u00e1m M\u1ee5c v\u1ec1 Gia \u0110\u00ecnh\

How can I change the above code to get the result like this?

    Tóm lược diễn tiến Thượng Hội Đồng Giám Mục về Gia Đình

Thank you.

My issue is solved by using mata's advice. Here the code that works for me. Thank you everyone for helping, especially mata.

 #!/usr/bin/python
#coding = utf-8
import urllib, urllib2
req = urllib2.Request(‘http://some website’)
req.add_header('User-agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')
f = urllib2.urlopen(req) 
body = f.read().decode('unicode-escape').encode('utf-8')
f.close()

Upvotes: 1

Views: 285

Answers (2)

Mauro Baraldi
Mauro Baraldi

Reputation: 6550

You must detect encoding from page. This info, in most cases, comes in request's header.

#!/usr/bin/python
#coding = utf-8

import cgi
import urllib2

req = urllib2.Request("http://some website")
req.add_header("User-agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36")
f = urllib2.urlopen(req)
encoding = f.headers.getparam('charset') # Here, you will detect the page encoding
body = f.read().decode(encoding) # Here you will define which encode use to decode data.
f.close()

There are another ways to get same result, but I just adapted to your approach.

Upvotes: 1

efirvida
efirvida

Reputation: 4855

you need to detect the encoding of the page the decode it, try using this lib for the encoding detection http://github.com/chardet/chardet se the usage help and example at http://chardet.readthedocs.org/en/latest/usage.html

pip install chardet

then use it

import urllib, urllib2
import chardet  #<- import this lib

req = urllib2.Request(‘http://some website’)
req.add_header('User-agent' : 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')
f = urllib2.urlopen(req) 
body = f.read()
f.close()

code = chardet.detect(body)           #<- detect the encoding
body = body.decode(code['encoding'])  #<- decode

Upvotes: 1

Related Questions