Reputation: 75
I'm scraping some data from a website and storing inside the item
dictionary one by one. How do I store all data as JSON format inside say, geramny_startup_jobs.json
file ? My code is here:
import scrapy
import json
import re
import textwrap
import JobsItem from JobsItem
class GermanyStartupJobs(scrapy.Spider):
name = 'JobsItem'
# start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/' + str(i) for i in range(1, 5)]
start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/']
def parse(self, response):
data = json.loads(response.body)
html = data['html']
selector = scrapy.Selector(text=data['html'], type="html")
hrefs = selector.xpath('//a/@href').extract()
for href in hrefs:
yield scrapy.Request(href, callback=self.parse_detail)
def parse_detail(self, response):
try:
full_d = str(response.xpath\
('//div[@class="col-sm-5 justify-text"]//*/text()').extract())
full_des_li = full_d.split(',')
full_des_lis = []
for f in full_des_li:
ff = "".join((f.strip().replace('\n', '')).split())
if len(ff) < 3:
continue
full_des_lis.append(f)
full = 'u'+ str(full_des_lis)
length = len(full)
full_des_list = textwrap.wrap(full, length/3)[:-1]
full_des_list.reverse()
# get the job title
try:
title = response.css('.job-title').xpath('./text()').extract_first().strip()
except:
print "No title"
title = ''
# get the company name
try:
company_name = response.css('.company-title').xpath('./normal/text()').extract_first().strip()
except:
print "No company name"
company_name = ''
# get the company location
try:
company_location = response.xpath('//a[@class="google_map_link"]/text()').extract_first().strip()
except:
print 'No company location'
company_location = ''
# get the job poster email (if available)
try:
pattern = re.compile(r"(\w(?:[-.+]?\w+)+\@(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)
for text in full_des_list:
email = pattern.findall(text)[-1]
if email is not None:
break
except:
print 'No email'
email = ''
# get the job poster phone number(if available)
try:
r = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", re.S)
phone = r.findall(full_des_list[0])[-1]
if phone is not None:
phone = '+49-' +phone
except:
print 'no phone'
phone = ''
# get the name of the poster (if available)
try:
for text in full_des_list:
names = get_human_names(text)
if len(names) != 0:
name = names[-1]
print name
break
except:
print 'no name found'
name = ''
item = {
'title': title,
'company name': company_name,
'company_location': company_location,
# 'poster name': name,
'email': email,
'phone': phone,
'source': u"Germany Startup Job"
}
yield item
except:
print 'Not valid'
# raise Exception("Think better!!")
I created another file similar to the following for modeling from the scrapy
website and imported inside the mentioned file.
import scrapy
class JobsItem(scrapy.Item):
title = scrapy.Field()
company_name = scrapy.Field()
company_location = scrapy.Field()
email = scrapy.Field()
phone = scrapy.Field()
source = scrapy.Field()
Then, I run the command scrapy crawl JobsItem -o geramny_startup_jobs.json
which seems doesn't work. I get the output Scrapy 1.2.2 - no active project
and is that meant I need to create a project for running this command which I dont intend to do.
Update: I find the command scrapy runspider file_name.py -o item.json
and its retunrs the out put in uncleaned format. Still need to get a clean output.
Upvotes: 0
Views: 736
Reputation: 221
You are not using your JobsItem
class in your spider. Replace this code:
item = {
'title': title,
'company name': company_name,
'company_location': company_location,
# 'poster name': name,
'email': email,
'phone': phone,
'source': u"Germany Startup Job"
}
with this code:
item = JobsItem()
item['title'] = title
item['company_name'] = company_name
item['company_location'] = company_location
item['email'] = email
item['phone'] = phone
item['source'] = u"Germany Startup Job"
In this way, your spider will be returning an item class and not a simple dictionary. This will allow Scrapy to write the items to disk when you use the flag -o geramny_startup_jobs.json
.
Upvotes: 1