I need help with a scraper I'm writing. I'm trying to scrape a table of university rankings, and some of those schools are European universities with foreign characters in their names (e.g. ä, ü). I'm already scraping another table on another site with foreign universities in the exact same way, and everything works fine. But for some reason, the current scraper won't work with foreign characters (and as far as parsing foreign characters, the two scrapers are exactly the same). Here's what I'm doing to try & make things work: Declare encoding on the very first line of the file: # -*- coding: utf-8 -*- Importing & using smart unicode from django framework from django.utils.encoding import smart_unicode school_name = smart_unicode(html_elements[2].text_content(), encoding='utf-8', strings_only=False, errors='strict').encode('utf-8') Use encode function, as seen above when chained with the smart_unicode function. I can't think of what else I could be doing wrong. Before dealing with these scrapers, I really didn't understand much about different encoding, so it's been a bit of an eye-opening experience. I've tried reading the following, but still can't overcome this problem http://farmdev.com/talks/unicode/ http://www.joelonsoftware.com/articles/Unicode.html I understand that in an encoding, every character is assigned a number, which can be expressed in hex, binary, etc. Different encodings have different capacities for how many languages they support (e.g. ASCII only supports English, UTF-8 supports everything it seems. However, I feel like I'm doing everything necessary to ensure the characters are printed correctly. I don't know where my mistake is, and it's driving me crazy. Please help!!

Reputation: 71

Issue with scraping site with foreign characters

I need help with a scraper I'm writing. I'm trying to scrape a table of university rankings, and some of those schools are European universities with foreign characters in their names (e.g. ä, ü). I'm already scraping another table on another site with foreign universities in the exact same way, and everything works fine. But for some reason, the current scraper won't work with foreign characters (and as far as parsing foreign characters, the two scrapers are exactly the same).

Here's what I'm doing to try & make things work:

Declare encoding on the very first line of the file:
```
# -*- coding: utf-8 -*-
```

Importing & using smart unicode from django framework from django.utils.encoding import smart_unicode

school_name = smart_unicode(html_elements[2].text_content(), encoding='utf-8',        
strings_only=False, errors='strict').encode('utf-8')

Use encode function, as seen above when chained with the smart_unicode function. I can't think of what else I could be doing wrong. Before dealing with these scrapers, I really didn't understand much about different encoding, so it's been a bit of an eye-opening experience. I've tried reading the following, but still can't overcome this problem
- http://farmdev.com/talks/unicode/
- http://www.joelonsoftware.com/articles/Unicode.html

I understand that in an encoding, every character is assigned a number, which can be expressed in hex, binary, etc. Different encodings have different capacities for how many languages they support (e.g. ASCII only supports English, UTF-8 supports everything it seems. However, I feel like I'm doing everything necessary to ensure the characters are printed correctly. I don't know where my mistake is, and it's driving me crazy. Please help!!

Upvotes: 7

Answers (3)

dda

Reputation: 6213

You need first to look at the <head> part of the document and see whether there's charset information:

<meta http-equiv="Content-Type" content="text/html; charset=xxxxx">

(Note that StackOverflow, this very page, doesn't have any charset info... I wonder how 中文字, which I typed assuming it's UTF-8 in here, will display on Chinese PeeCees that are most probably set up as GBK, or Japanese pasokon which are still firmly in Shift-JIS land).

So if you have a charset, you know what to expect, and deal with it accordingly. If not, you'll have to do some educated guessing -- are there non-ASCII chars (>127) in the plain text version of the page? Are there HTML entities like 一 (一) or é (é)?

Once you have guessed/ascertained the encoding of the page, you can convert that to UTF-8, and be on your way.

Upvotes: -1

schlamar

Reputation: 9509

If you are using the requests library it will automatically decode the content based on HTTP headers. Getting the HTML content of a page is really easy:

>>> import requests
>>> r = requests.get('https://github.com/timeline.json')
>>> r.text
'[{"repository":{"open_issues":0,"url":"https://github.com/...

Upvotes: 1

Jukka K. Korpela

Reputation: 201828

When extracting information from a web page, you need to determine its character encoding, similarly to how browsers do such things (analyzing HTTP headers, parsing HTML to find meta tags, and possibly guesswork based on the actual data, e.g. the presence of something that looks like BOM in some encoding). Hopefully you can find a library routine that does this for you.

In any case, you should not expect all web sites to be utf-8 encoded. Iso-8859-1 is still in widespread use, and in general reading iso-8859-1 as if it were utf-8 results in a big mess (for any non-Ascii characters).

Upvotes: 2

Issue with scraping site with foreign characters

Answers (3)

Related Questions