Chak
Chak

Reputation: 75

How do I save scraped data the most easy way inside a JSON file?

I'm scraping some data from a website and storing inside the item dictionary one by one. How do I store all data as JSON format inside say, geramny_startup_jobs.json file ? My code is here:

import scrapy
import json
import re
import textwrap 
import JobsItem from JobsItem


class GermanyStartupJobs(scrapy.Spider):

    name = 'JobsItem'
    # start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/' + str(i) for i in range(1, 5)]
    start_urls= ['https://www.germanystartupjobs.com/jm-ajax/get_listings/']

    def parse(self, response):

        data = json.loads(response.body)
        html = data['html']
        selector = scrapy.Selector(text=data['html'], type="html")
        hrefs = selector.xpath('//a/@href').extract()

        for href in hrefs:
            yield scrapy.Request(href, callback=self.parse_detail)


    def parse_detail(self, response):

        try:
            full_d  = str(response.xpath\
                ('//div[@class="col-sm-5 justify-text"]//*/text()').extract()) 

            full_des_li = full_d.split(',')
            full_des_lis = []

            for f in full_des_li:
                ff = "".join((f.strip().replace('\n', '')).split())
                if len(ff) < 3:
                    continue 
                full_des_lis.append(f)

            full = 'u'+ str(full_des_lis)

            length = len(full)
            full_des_list = textwrap.wrap(full, length/3)[:-1]

            full_des_list.reverse()


            # get the job title             
            try:
                title = response.css('.job-title').xpath('./text()').extract_first().strip()
            except:
                print "No title"
                title = ''

            # get the company name
            try:
                company_name = response.css('.company-title').xpath('./normal/text()').extract_first().strip()
            except:
                print "No company name"
                company_name = ''


            # get the company location  
            try:
                company_location = response.xpath('//a[@class="google_map_link"]/text()').extract_first().strip()
            except:
                print 'No company location'
                company_location = ''

            # get the job poster email (if available)            
            try:
                pattern = re.compile(r"(\w(?:[-.+]?\w+)+\@(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)

                for text in full_des_list:
                    email = pattern.findall(text)[-1]
                    if email is not None:
                        break   
            except:
                print 'No email'
                email = ''

            # get the job poster phone number(if available)                        
            try:
                r = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", re.S)
                phone = r.findall(full_des_list[0])[-1]

                if phone is not None:
                    phone = '+49-' +phone

            except:
                print 'no phone'
                phone = ''


            # get the name of the poster (if available)
            try:
                for text in full_des_list:
                    names = get_human_names(text)
                    if len(names) !=  0:
                        name = names[-1]
                        print name
                        break
            except:
                print 'no name found'
                name = ''

            item = {
                'title': title,
                'company name': company_name,
                'company_location': company_location, 
                # 'poster name': name,
                'email': email,
                'phone': phone,
                'source': u"Germany Startup Job" 
            }
            yield item


        except:
            print 'Not valid'
            # raise Exception("Think better!!")

I created another file similar to the following for modeling from the scrapy website and imported inside the mentioned file.

import scrapy

class JobsItem(scrapy.Item):
    title = scrapy.Field()
    company_name = scrapy.Field()
    company_location = scrapy.Field()
    email = scrapy.Field()
    phone = scrapy.Field()
    source = scrapy.Field()

Then, I run the command scrapy crawl JobsItem -o geramny_startup_jobs.json which seems doesn't work. I get the output Scrapy 1.2.2 - no active project and is that meant I need to create a project for running this command which I dont intend to do.

Update: I find the command scrapy runspider file_name.py -o item.json and its retunrs the out put in uncleaned format. Still need to get a clean output.

Upvotes: 0

Views: 736

Answers (1)

Carlos Pe&#241;a
Carlos Pe&#241;a

Reputation: 221

You are not using your JobsItem class in your spider. Replace this code:

item = {
    'title': title,
    'company name': company_name,
    'company_location': company_location, 
    # 'poster name': name,
    'email': email,
    'phone': phone,
    'source': u"Germany Startup Job" 
}

with this code:

item = JobsItem()
item['title'] = title
item['company_name'] = company_name
item['company_location'] = company_location
item['email'] = email
item['phone'] = phone
item['source'] = u"Germany Startup Job" 

In this way, your spider will be returning an item class and not a simple dictionary. This will allow Scrapy to write the items to disk when you use the flag -o geramny_startup_jobs.json.

Upvotes: 1

Related Questions