Reputation: 33
I'm trying to scrape indeed to get the information of all the job listings in Bangalore.
URL : https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru,+Karnataka&start=0
Xpath for the parent div that i'm interested in :
//div[contains(@class, "jobsearch-SerpJobCard")]
I want to extract the company name which is structured like this :
<span class="company">
<a>
Micro Focus
</a>
</span>
and some like :
<div>
<span class="company">
SSG <b>Software</b> Systems</span>
</div>
I'm using a common Xpath expression to scrape both kind of titles. I am having trouble with the second type as it includes multiple escape characters like \n which reflect in my results and on stripping result in an empty string.
Xpath used to extract titles:
//div[contains(@class, "jobsearch-SerpJobCard")]//span[@class="company"]/text()
Result :
['\n ', '\n ', '\n ', '\n Client of Analytics Human Capital', '\n Advantage Tech', '\n ', '\n SQUARE', '\n DART', '\n posmab technologies', '\n ', '\n PENTAMOUNT TECHNOLOGIES', '\n ', '\n
MobileComm, Inc.', '\n IGLOBAL IMPACT ITES PVT.LTD.', '\n
', '\n ']
what can i do to get rid of those extra '\n' characters ?
Upvotes: 3
Views: 1007
Reputation: 28216
You can use the normalize-space
XPath function to achieve this.
>>> fetch('https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru,+Karnataka&start=0')
2018-12-15 09:47:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indeed.co.in/jobs?q=software+developer&l=Bengaluru,+Karnataka&start=0> (referer: None)
>>> response.xpath('//div[contains(@class, "jobsearch-SerpJobCard")]//span[@class="company"]').xpath('normalize-space()').getall()
['Amazon.com', 'Sabre', 'Altisource Labs', 'CGI', 'Allscripts Solutions', 'Shilpin Consulting', 'Access6 technology', 'CGI Group, Inc.', 'Misys Software Solutions India', 'Siemens AG']
Upvotes: 7