Scrapy get text spanning multiple lines and within nested elements

Question

I'm trying to scrape indeed to get the information of all the job listings in Bangalore.

URL : https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru,+Karnataka&start=0

Xpath for the parent div that i'm interested in :

//div[contains(@class, "jobsearch-SerpJobCard")]

I want to extract the company name which is structured like this :


        
              Micro Focus

and some like :


    
        SSG Software Systems

I'm using a common Xpath expression to scrape both kind of titles. I am having trouble with the second type as it includes multiple escape characters like which reflect in my results and on stripping result in an empty string.

Xpath used to extract titles:

//div[contains(@class, "jobsearch-SerpJobCard")]//span[@class="company"]/text()

Result :

[' ', ' ', ' ', ' Client of Analytics Human Capital', ' Advantage Tech', ' ', ' SQUARE', ' DART', ' posmab technologies', ' ', ' PENTAMOUNT TECHNOLOGIES', ' ', '
MobileComm, Inc.', ' IGLOBAL IMPACT ITES PVT.LTD.', '
', ' ']

what can i do to get rid of those extra ' ' characters ?

stranac · Accepted Answer

You can use the normalize-space XPath function to achieve this.

>>> fetch('https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru,+Karnataka&start=0')
2018-12-15 09:47:22 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)
>>> response.xpath('//div[contains(@class, "jobsearch-SerpJobCard")]//span[@class="company"]').xpath('normalize-space()').getall()
['Amazon.com', 'Sabre', 'Altisource Labs', 'CGI', 'Allscripts Solutions', 'Shilpin Consulting', 'Access6 technology', 'CGI Group, Inc.', 'Misys Software Solutions India', 'Siemens AG']

Scrapy get text spanning multiple lines and within nested elements

Answers (1)

Related Questions