AJITH SHENOY
AJITH SHENOY

Reputation: 33

Scrapy get text spanning multiple lines and within nested elements

I'm trying to scrape indeed to get the information of all the job listings in Bangalore.

URL : https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru,+Karnataka&start=0

Xpath for the parent div that i'm interested in :

//div[contains(@class, "jobsearch-SerpJobCard")]

I want to extract the company name which is structured like this :

<span class="company">
        <a>
              Micro Focus
        </a>
</span>

and some like :

<div>
    <span class="company">
        SSG <b>Software</b> Systems</span>

    </div>

I'm using a common Xpath expression to scrape both kind of titles. I am having trouble with the second type as it includes multiple escape characters like \n which reflect in my results and on stripping result in an empty string.

Xpath used to extract titles:

//div[contains(@class, "jobsearch-SerpJobCard")]//span[@class="company"]/text()

Result :

['\n ', '\n ', '\n ', '\n Client of Analytics Human Capital', '\n Advantage Tech', '\n ', '\n SQUARE', '\n DART', '\n posmab technologies', '\n ', '\n PENTAMOUNT TECHNOLOGIES', '\n ', '\n
MobileComm, Inc.', '\n IGLOBAL IMPACT ITES PVT.LTD.', '\n
', '\n ']

what can i do to get rid of those extra '\n' characters ?

Upvotes: 3

Views: 1007

Answers (1)

stranac
stranac

Reputation: 28216

You can use the normalize-space XPath function to achieve this.

>>> fetch('https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru,+Karnataka&start=0')
2018-12-15 09:47:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru,+Karnataka&start=0> (referer: None)
>>> response.xpath('//div[contains(@class, "jobsearch-SerpJobCard")]//span[@class="company"]').xpath('normalize-space()').getall()
['Amazon.com', 'Sabre', 'Altisource Labs', 'CGI', 'Allscripts Solutions', 'Shilpin Consulting', 'Access6 technology', 'CGI Group, Inc.', 'Misys Software Solutions India', 'Siemens AG']

Upvotes: 7

Related Questions