Laurie
Laurie

Reputation: 1229

Python - Scrapy to Json Output Splitting

I'm learning about NLP, and to do this I'm scraping an Amazon book-review using Scrapy. I've extracted the fields that I want, and am outputting them to a Json file format. When this file is loaded as a df, each field is recorded as a list rather than an individual line-per-line format. How can I split this list so that the df will have a row for each item, rather than all item entries being recorded in seperate lists? Code:

import scrapy


class ReviewspiderSpider(scrapy.Spider):
    name = 'reviewspider'
    allowed_domains = ['amazon.co.uk']
    start_urls = ['https://www.amazon.com/Gone-Girl-Gillian-Flynn/product-reviews/0307588378/ref=cm_cr_othr_d_paging_btm_1?ie=UTF8&reviewerType=all_reviews&pageNumber=1']

def parse(self, response):
    users = response.xpath('//a[contains(@data-hook, "review-author")]/text()').extract()
    titles = response.xpath('//a[contains(@data-hook, "review-title")]/text()').extract()
    dates = response.xpath('//span[contains(@data-hook, "review-date")]/text()').extract()
    found_helpful = response.xpath('//span[contains(@data-hook, "helpful-vote-statement")]/text()').extract()
    rating = response.xpath('//i[contains(@data-hook, "review-star-rating")]/span[contains(@class, "a-icon-alt")]/text()').extract()
    content = response.xpath('//span[contains(@data-hook, "review-body")]/text()').extract()

    yield {
        'users' : users.extract(),
        'titles' : titles.extract(),
        'dates' : dates.extract(),
        'found_helpful' : found_helpful.extract(),
        'rating' : rating.extract(),
        'content' : content.extract()
    }

Sample Output:

users = ['Lauren', 'James'...'John']
dates = ['on September 28, 2017', 'on December 26, 2017'...'on November 17, 2016']
rating = ['5.0 out of 5 stars', '2.0 out of 5 stars'...'5.0 out of 5 stars']

Desired Output:

index 1: [users='Lauren', dates='on September 28, 2017', rating='5.0 out of 5 stars']
index 2: [users='James', dates='On December 26, 2017', rating='5.0 out of 5 stars']
...

I know that the Pipeline related to the spider should probably be edited to achieve this, however I have limited Python knowledge and couldn't understand the Scrapy documentation. I've also tried the solutions from here and here, however I don't know enough to be able to consolidate the answers with my own code. Any help would be very appreciated.

Upvotes: 0

Views: 1068

Answers (2)

Laurie
Laurie

Reputation: 1229

EDIT: I was able to come up with the solution by using the .css method instead of .xpath. The spider I used for scraping shirt-listings from a fashion-retailer:

import scrapy
from ..items import ProductItem

class SportsdirectSpider(scrapy.Spider):
    name = 'sportsdirect'
    allowed_domains = ['www.sportsdirect.com']
    start_urls = ['https://www.sportsdirect.com/mens/mens-shirts']

def parse(self, response):
    products = response.css('.s-productthumbbox')
    for p in products:
        brand = p.css('.productdescriptionbrand::text').extract_first()
        name = p.css('.productdescriptionname::text').extract_first()
        price = p.css('.curprice::text').extract_first()
        item = ProductItem()
        item['brand'] = brand
        item['name'] = name
        item['price'] = price
        yield item

The related items.py script:

import scrapy

class ProductItem(scrapy.Item):
    name = scrapy.Field()
    brand = scrapy.Field()
    name = scrapy.Field()
    price = scrapy.Field()

Creation of a json-lines file (in Anaconda prompt):

>>> cd simple_crawler
>>> scrapy crawl sportsdirect --set FEED_URI=products.jl

The code used to turn the created .jl file into a dataframe:

import json
import pandas as pd
contents = open('products3.jl', "r").read() 
data = [json.loads(str(item)) for item in contents.strip().split('\n')]
df2 = pd.DataFrame(data)

Final output:

        brand        name                        price
0   Pierre Cardin    Short Sleeve Shirt Mens     £6.50 
1   Pierre Cardin    Short Sleeve Shirt Mens     £7.00 
...

Upvotes: 0

Iguananaut
Iguananaut

Reputation: 23306

After re-reading your question I'm pretty sure this is what you want:

def parse(self, response):
    users = response.xpath('//a[contains(@data-hook, "review-author")]/text()').extract()
    titles = response.xpath('//a[contains(@data-hook, "review-title")]/text()').extract()
    dates = response.xpath('//span[contains(@data-hook, "review-date")]/text()').extract()
    found_helpful = response.xpath('//span[contains(@data-hook, "helpful-vote-statement")]/text()').extract()
    rating = response.xpath('//i[contains(@data-hook, "review-star-rating")]/span[contains(@class, "a-icon-alt")]/text()').extract()
    content = response.xpath('//span[contains(@data-hook, "review-body")]/text()').extract()

    for user, title, date, found_helpful, rating, content in zip(users, titles, dates, found_helpful, rating, content):
        yield {
            'user': user,
            'title': title,
            'date': date,
            'found_helpful': found_helpful,
            'rating': rating,
            'content': content
        }

or something to that effect. That's what I was trying to hint at in my first comment.

Upvotes: 1

Related Questions