Scraping a webpage with URL in Nepali (Non-English)

Question

I am going through a website whose web page have urls in Nepali i.e. Non-English font. How do I give the start_urls for any spider(I am using scrapy for the purpose)? Is there any kind of encoding technique for that? And does the direct copy-paste of urls from browser a chance?

Updated: And I need to further parse into links that I get at certain webpage. And of course those links are non- English as well. Thank you...

Martijn Pieters · Accepted Answer

URLs that conform to RFC 3986 will be encoded using UTF-8 and URL Percent Encoding. Nepali uses the Devanagari script, which is perfectly representable in Unicode and thus can be encoded in UTF-8.

Take a look at the Nepali Wikipedia for examples. That specific URL is a good example of the UTF-8 and URL percent encoding:

http://ne.wikipedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0

The series of %E0%A4%AE escapes are percent-encoded UTF-8 bytes. The HTML source code of the page should have these URLs already encoded, but if they look like this instead:

http://ne.wikipedia.org/wiki/मुख्य_पृष्ठ

you can encode the path portion yourself with:

import urlparse, urllib

parts = urlparse.urlsplit(u'http://ne.wikipedia.org/wiki/मुख्य_पृष्ठ')
parts = parts._replace(path=urllib.quote(parts.path.encode('utf8')))
encoded_url = parts.geturl().encode('ascii')

Demo:

>>> import urlparse, urllib
>>> parts = urlparse.urlsplit(u'http://ne.wikipedia.org/wiki/मुख्य_पृष्ठ')
>>> parts = parts._replace(path=urllib.quote(parts.path.encode('utf8')))
>>> parts.geturl().encode('ascii')
'http://ne.wikipedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0'

Scraping a webpage with URL in Nepali (Non-English)

Answers (1)

Related Questions