How can I use non-ASCII characters?

Question

I am using Scrapy and XPath to parse web-site in Russian language.

In this topic, alecxe suggested me how to construct the xpath expression to get the values. However, I don't understand how can I handle the case when the Param1_name is in Russian?

Here is the xpath expression:

//*[text()="Param1_name_in_russian"]/following-sibling::text()

Html snippet:


            
                
                      
                         Param1_name_in_russian" Param1_value"
                      
                         Param2_name_in_russian" Param2_value
                      
                         Param3_name_in_russian" Param3_value"
                
              
            
                
                    
                       Param4_name_in_russianParam4_value
                
                   Param5_name
                      Param5_value

EDITED based on comments

I assume I didn't specify properly the question since all suggested solutions didn't work for me i.e. when I tested the suggested XPath expressions in Scrapy console output was nothing. Thus, I provide more detailed information about web-site that I need to parse:

link to the web-site: link to real-estate web site
screenshot of what I need to parse:

screen_shot

WGS · Accepted Answer

Consider declaring your encoding at the beginning of the file as latin-1. See the documentation for a thorough explanation as to why.

I'll be using lxml instead of Scrapy below, but the logic is the same.

Code:

#!/usr/bin/env python
# -*- coding: latin-1 -*-

from lxml import html

markup = """div class="obj-params">
            
                
                      
                         Некий текст" Param1_value"
                      
                         Param2_name_in_russian" Param2_value
                      
                         Param3_name_in_russian" Param3_value"
                
              
            
                
                    
                       Param4_name_in_russianParam4_value
                
                   Param5_name
                      Param5_value"""

tree = html.fromstring(markup)
pone_val = tree.xpath(u"//*[text()='Некий текст']/following-sibling::text()")

print pone_val

Result:

['" Param1_value"']
[Finished in 0.5s]

Note that since this is a unicode string, the u at the beginning of the Xpath is necessary, same as @warwaruk's comment in your question.

Let us know if this helps.

EDIT:

Based on the site's markup, there's actually a better way to get the values. Again, using lxml and not Scrapy since the difference between the two here is just .extract() anyway. Basically, check my XPath for the name, room, square, and floor.

import requests as rq
from lxml import html

url = "http://www.lun.ua/%D0%BF%D1%80%D0%BE%D0%B4%D0%B0%D0%B6%D0%B0-%D0%BA%D0%B2%D0%B0%D1%80%D1%82%D0%B8%D1%80-%D0%BA%D0%B8%D0%B5%D0%B2"
r = rq.get(url)
tree = html.fromstring(r.text)

divs = tree.xpath("//div[@class='obj-left']")

for div in divs:

    name = div.xpath("./h3/span/a/text()")[0]
    details = div.xpath(".//div[@class='obj-params-col'][1]")[0]
    room = details.xpath("./p[1]/text()[last()]")[0]
    square = details.xpath("./p[2]/text()[last()]")[0]
    floor = details.xpath("./p[3]/text()[last()]")[0]

    print name.encode("utf-8")
    print room.encode("utf-8")
    print square.encode("utf-8")
    print floor.encode("utf-8")

This doesn't print them out all well on my end (getting some [Decode error - output not utf-8]). However, I believe that encoding aside, using this approach is much better scraping practice overall.

Let us know what you think.

How can I use non-ASCII characters?

Answers (1)

Related Questions