Reputation: 1229
I'm learning about NLP, and to do this I'm scraping an Amazon book-review using Scrapy. I've extracted the fields that I want, and am outputting them to a Json file format. When this file is loaded as a df, each field is recorded as a list rather than an individual line-per-line format. How can I split this list so that the df will have a row for each item, rather than all item entries being recorded in seperate lists? Code:
import scrapy
class ReviewspiderSpider(scrapy.Spider):
name = 'reviewspider'
allowed_domains = ['amazon.co.uk']
start_urls = ['https://www.amazon.com/Gone-Girl-Gillian-Flynn/product-reviews/0307588378/ref=cm_cr_othr_d_paging_btm_1?ie=UTF8&reviewerType=all_reviews&pageNumber=1']
def parse(self, response):
users = response.xpath('//a[contains(@data-hook, "review-author")]/text()').extract()
titles = response.xpath('//a[contains(@data-hook, "review-title")]/text()').extract()
dates = response.xpath('//span[contains(@data-hook, "review-date")]/text()').extract()
found_helpful = response.xpath('//span[contains(@data-hook, "helpful-vote-statement")]/text()').extract()
rating = response.xpath('//i[contains(@data-hook, "review-star-rating")]/span[contains(@class, "a-icon-alt")]/text()').extract()
content = response.xpath('//span[contains(@data-hook, "review-body")]/text()').extract()
yield {
'users' : users.extract(),
'titles' : titles.extract(),
'dates' : dates.extract(),
'found_helpful' : found_helpful.extract(),
'rating' : rating.extract(),
'content' : content.extract()
}
Sample Output:
users = ['Lauren', 'James'...'John']
dates = ['on September 28, 2017', 'on December 26, 2017'...'on November 17, 2016']
rating = ['5.0 out of 5 stars', '2.0 out of 5 stars'...'5.0 out of 5 stars']
Desired Output:
index 1: [users='Lauren', dates='on September 28, 2017', rating='5.0 out of 5 stars']
index 2: [users='James', dates='On December 26, 2017', rating='5.0 out of 5 stars']
...
I know that the Pipeline related to the spider should probably be edited to achieve this, however I have limited Python knowledge and couldn't understand the Scrapy documentation. I've also tried the solutions from here and here, however I don't know enough to be able to consolidate the answers with my own code. Any help would be very appreciated.
Upvotes: 0
Views: 1068
Reputation: 1229
EDIT: I was able to come up with the solution by using the .css method instead of .xpath. The spider I used for scraping shirt-listings from a fashion-retailer:
import scrapy
from ..items import ProductItem
class SportsdirectSpider(scrapy.Spider):
name = 'sportsdirect'
allowed_domains = ['www.sportsdirect.com']
start_urls = ['https://www.sportsdirect.com/mens/mens-shirts']
def parse(self, response):
products = response.css('.s-productthumbbox')
for p in products:
brand = p.css('.productdescriptionbrand::text').extract_first()
name = p.css('.productdescriptionname::text').extract_first()
price = p.css('.curprice::text').extract_first()
item = ProductItem()
item['brand'] = brand
item['name'] = name
item['price'] = price
yield item
The related items.py script:
import scrapy
class ProductItem(scrapy.Item):
name = scrapy.Field()
brand = scrapy.Field()
name = scrapy.Field()
price = scrapy.Field()
Creation of a json-lines file (in Anaconda prompt):
>>> cd simple_crawler
>>> scrapy crawl sportsdirect --set FEED_URI=products.jl
The code used to turn the created .jl file into a dataframe:
import json
import pandas as pd
contents = open('products3.jl', "r").read()
data = [json.loads(str(item)) for item in contents.strip().split('\n')]
df2 = pd.DataFrame(data)
Final output:
brand name price
0 Pierre Cardin Short Sleeve Shirt Mens £6.50
1 Pierre Cardin Short Sleeve Shirt Mens £7.00
...
Upvotes: 0
Reputation: 23306
After re-reading your question I'm pretty sure this is what you want:
def parse(self, response):
users = response.xpath('//a[contains(@data-hook, "review-author")]/text()').extract()
titles = response.xpath('//a[contains(@data-hook, "review-title")]/text()').extract()
dates = response.xpath('//span[contains(@data-hook, "review-date")]/text()').extract()
found_helpful = response.xpath('//span[contains(@data-hook, "helpful-vote-statement")]/text()').extract()
rating = response.xpath('//i[contains(@data-hook, "review-star-rating")]/span[contains(@class, "a-icon-alt")]/text()').extract()
content = response.xpath('//span[contains(@data-hook, "review-body")]/text()').extract()
for user, title, date, found_helpful, rating, content in zip(users, titles, dates, found_helpful, rating, content):
yield {
'user': user,
'title': title,
'date': date,
'found_helpful': found_helpful,
'rating': rating,
'content': content
}
or something to that effect. That's what I was trying to hint at in my first comment.
Upvotes: 1