Reputation: 10383
I downloaded the source of one page from Indeed and I'm trying to get all the job titles from there, for that I'm using this xpath:
response.xpath('//*[@class=" row result"]//*[@class="jobtitle"]//text()').extract()
The issue is that the results aren't in one line hence and getting this result:
[u'\n ',
u'Data',
u' ',
u'Scientist',
u' Experto SQL con conocimiento en R',
u'\n ',
u'\n ',
u'Data',
u' Analytic con Python',
u'\n ',
u'\n ',
u'Data',
u' Analytic con R',
Which is problematic to map with the rest of the data, what I want is to select process the jobs one by one, something similar to extract_first()
response.xpath('//*[@class=" row result"]').extract_first()
But for any given index and with the option to keep processing the data. I tried this:
current_job = response.xpath('//*[@class=" row result"]').extract_first()
current_job = TextResponse(url='',body=current_job,encoding='utf-8')
But it only works for the first result and it doesn't look like a pythonic approach to me.
Upvotes: 1
Views: 736
Reputation: 142651
First I would get only a
(without text()
and extract()
) and then I would use for
to use text()
and extract()
with every a
separatelly, and join()
to concatenate elements to string with title.
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://www.indeed.cl/trabajo?q=Data%20scientist&l=']
def parse(self, response):
print('url:', response.url)
results = response.xpath('//h2[@class="jobtitle"]/a')
print('number:', len(results))
for item in results:
title = ''.join(item.xpath('.//text()').extract())
print('title:', title)
# --- it runs without project and saves in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(MySpider)
c.start()
Result:
number: 10
title: Data Scientist
title: CONSULTOR DATA SCIENCE SANTIAGO DE CHILE
title: Líder Análisis de Datos MCoE Minerals Americas
title: Ingeniero Inteligencia Mercado, BI
title: Ingeniero Inteligencia de Mercado, Business Intelligence
title: Data Scientist
title: Data Scientist
title: Data Scientist (Machine Learning)
title: Data Scientist / Ml Scientist
title: Young Professional - Spanish LatAm
Upvotes: 2
Reputation: 154
Give it a go. You need to change my script a little to fit for your project. It can solve the issues you have mentioned above.
import requests
from scrapy import Selector
res = requests.get("https://www.indeed.cl/trabajo?q=Data%20scientist")
sel = Selector(res)
for item in sel.css("h2.jobtitle a"):
title = ' '.join(item.css("::text").extract())
print(title)
Output:
Data Scientist
CONSULTOR DATA SCIENCE SANTIAGO DE CHILE
Líder Análisis de Datos MCoE Minerals Americas
Ingeniero Inteligencia Mercado, BI
Ingeniero Inteligencia de Mercado, Business Intelligence
Data Scientist
Data Scientist
Young Professional - Spanish LatAm
Data Scientist (Machine Learning)
Data Scientist / Ml Scientist
Upvotes: 1