Tianhe Xie
Tianhe Xie

Reputation: 261

mimic web URL encode for Chinese character in python

I want to mimic URL encoding for Chinese characters. For my use case, I have a searching URL for a e-commerce site

'https://search.jd.com/Search?keyword={}'.format('ipad')

When I search a product in english, this works fine. However, I need to have input in Chinese, I tried

'https://search.jd.com/Search?keyword={}'.format('耐克t恤')

, and found the following encoding under the network tab

https://list.tmall.com/search_product.htm?q=%C4%CD%BF%CBt%D0%F4

So basically, I need to encode inputs like '耐克t恤' into '%C4%CD%BF%CBt%D0%F4'. I'm not sure which encoding the website is using? Also, how to convert Chinese characters to these encodings with python?

Update: I checked headers and it seems like content encoding is gzip?

Upvotes: 1

Views: 951

Answers (2)

Vignesh Bayari R.
Vignesh Bayari R.

Reputation: 663

Try using the library urllib.parse module. More specifically, urllib.parse.urlencode() function. You can pass the encoding (in this case it appears to be 'gb2312') and a dict containing the query parameters to get a valid valid url suffix which you can use directly.

In this case, your code will look something like:

import urllib.parse

keyword = '耐克t恤'
url = 'https://search.jd.com/Search?{url_suffix}'.format(url_suffix=urllib.parse.urlencode({'keyword': keyword}, encoding='gb2312'))

More info about encoding here More info about urlencode here

Upvotes: 6

tjallo
tjallo

Reputation: 791

The encoding used seems to be GB2312

This could help you:

def encodeGB2312(data):
    hexData = data.encode(encoding='GB2312').hex().upper()    
    encoded = '%' + '%'.join(hexData[i:i + 2] for i in range(0, len(hexData), 2)) 
    return encoded    

output = encodeGB2312('耐克t恤')

print(output)
url = f'https://list.tmall.com/search_product.htm?q={output}'
print(url)

Output:

%C4%CD%BF%CB%74%D0%F4
https://list.tmall.com/search_product.htm?q=%C4%CD%BF%CB%74%D0%F4

The only problem with my code is that it doesn't seem to 100% corrospond with the link you are trying to achieve. It converts the t chacaracter into GB2312 encoding. While it seems to use the non encoded t character in your link. Altough it still seems to work when opening the url.

Edit:

Vignesh Bayari R his post handles the URL in the correct (intended) way. But in this case my solution works too.

Upvotes: 2

Related Questions