Reputation: 11776
I am going through a website whose web page have urls in Nepali i.e. Non-English font. How do I give the start_urls for any spider(I am using scrapy for the purpose)? Is there any kind of encoding technique for that? And does the direct copy-paste of urls from browser a chance?
Updated: And I need to further parse into links that I get at certain webpage. And of course those links are non- English as well. Thank you...
Upvotes: 0
Views: 674
Reputation: 1123850
URLs that conform to RFC 3986 will be encoded using UTF-8 and URL Percent Encoding. Nepali uses the Devanagari script, which is perfectly representable in Unicode and thus can be encoded in UTF-8.
Take a look at the Nepali Wikipedia for examples. That specific URL is a good example of the UTF-8 and URL percent encoding:
http://ne.wikipedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0
The series of %E0%A4%AE
escapes are percent-encoded UTF-8 bytes. The HTML source code of the page should have these URLs already encoded, but if they look like this instead:
http://ne.wikipedia.org/wiki/मुख्य_पृष्ठ
you can encode the path portion yourself with:
import urlparse, urllib
parts = urlparse.urlsplit(u'http://ne.wikipedia.org/wiki/मुख्य_पृष्ठ')
parts = parts._replace(path=urllib.quote(parts.path.encode('utf8')))
encoded_url = parts.geturl().encode('ascii')
Demo:
>>> import urlparse, urllib
>>> parts = urlparse.urlsplit(u'http://ne.wikipedia.org/wiki/मुख्य_पृष्ठ')
>>> parts = parts._replace(path=urllib.quote(parts.path.encode('utf8')))
>>> parts.geturl().encode('ascii')
'http://ne.wikipedia.org/wiki/%E0%A4%AE%E0%A5%81%E0%A4%96%E0%A5%8D%E0%A4%AF_%E0%A4%AA%E0%A5%83%E0%A4%B7%E0%A5%8D%E0%A4%A0'
Upvotes: 1