smatthewenglish
smatthewenglish

Reputation: 2889

determine encoding of content scraped from xpath. convert to unicode

I used the firefox xpath extractor to extract the following snippet from this website: http://www.zdic.net/z/19/js/5DCD.htm

The part I'm looking for is 丨フ丨ノ一丨ノ丶フノ一ノ丨フ一一ノフフ丶

The xpath extractor add on served me up the following id('z_i_t2_bis')

I input that to the scrapy shell with this command: response.selector.xpath("id('z_i_t2_bis')").extract()

It returned this:

[u'<span id="z_i_t2_bis" title="\u7ad6\u6298\u7ad6\u6487\u6a2a\u7ad6\u6487\u637a\u6298\u6487\u6a2a\u6487\u7ad6\u6298\u6a2a\u6a2a\u6487\u6298\u6298\u637a">\u4e28\u30d5\u4e28\u30ce\u4e00\u4e28\u30ce\u4e36\u30d5\u30ce\u4e00\u30ce\u4e28\u30d5\u4e00\u4e00\u30ce\u30d5\u30d5\u4e36</span>']

How can I tell if that's what I want?

It seems to be encoded for HTML, is there a way to put it back to unicode?

Upvotes: 0

Views: 96

Answers (1)

mkiever
mkiever

Reputation: 891

It's already unicode. It's just an escaped representation. So you could check directly for your pattern with the 'in' operator:

pattern = u'丨フ丨ノ一丨ノ丶フノ一ノ丨フ一一ノフフ丶'
result = [u'<span id="z_i_t2_bis" title="\u7ad6\u6298\u7ad6\u6487\u6a2a\u7ad6\u6487\u637a\u6298\u6487\u6a2a\u6487\u7ad6\u6298\u6a2a\u6a2a\u6487\u6298\u6298\u637a">\u4e28\u30d5\u4e28\u30ce\u4e00\u4e28\u30ce\u4e36\u30d5\u30ce\u4e00\u30ce\u4e28\u30d5\u4e00\u4e00\u30ce\u30d5\u30d5\u4e36</span>']

if pattern in result[0]:
    print('found')

Upvotes: 1

Related Questions