detroxd
detroxd

Reputation: 3

Scrapy spider is repeating scraped data

from scrapy.spiders import Spider from ..items import QtItem

class QuoteSpider(Spider):
    name = 'acres'
    start_urls = ['any_url']

def parse(self, response):
    items = QtItem()

    all_div_names = response.xpath('//article')

    for bks in all_div_names:
        name = all_div_names.xpath('//span[@class="css-fwbz9r"]/text()').extract()
        price = all_div_names.xpath('//h2[@class="css-yr18fa"]/text()').extract()
        sqft = all_div_names.xpath('//div[@class="css-1ty8tu4"]/text()').extract()
        bhk = all_div_names.xpath('//a[@class="css-163eyf0"]/text()').extract()

    yield {
        'ttname': name,
        'ttprice': price,
        'ttsqft': sqft,
        'ttbhk': bhk
    }

the question has been answered

Upvotes: 0

Views: 67

Answers (2)

AaronS
AaronS

Reputation: 2335

Corrections

  1. Add in .// instead of // for each variable you're looping over
  2. Use bks instead of all_div_names.
  3. Use get() instead of extract() as it's one item within the span. get() grabs a single item, extract() grabs multiple items.
  4. Your yield statement is not within the for loop. To yield each variable into the dictionary the yield statement needs be within the for loop.

eg. name = bks.xpath('.//span[@class="css-fwbz9r"]/text()').get()

Tips

  1. .// traverses all child elements of all_div_names XPATH selector. Should always use .// when you're looping over an XPATH selector with multiple items such as all_div_names. eg name = bks.xpath('.//span[@class="css-fwbz9r"]/text()').get() You will access all span elements of bks in this XPATH selector by using .//.
  2. use getall() instead of extract() and get() instead of extract_first(). With get() you will always get a string, with extract() you wont know if you're getting a list or string unfortunately!
  3. Use an Items dictionary rather than yielding a dictionary. It's easier to do things like pipelines. That is a pipeline modifys data. Eg for modifying what Items will be outputted to a json file etc... A common pipeline is a duplicates pipeline which an example can be found on scrapy docs. You can drop certain items from the item dictionary if it's a duplicate piece of data using this pipeline. I almost never yield a dictionary for scraping projects unless the data is highly structured, requiring no modifications or there is no duplicate information extracted.
  4. Consider using Scrapy's ItemLoaders for any scraping project where the data you're extracting requires simple modification eg clearing newlines, changing the extracted data slightly . You'll be surprised how often this is.

Code Example

def parse(self, response):
    items = QtItem()

    all_div_names = response.xpath('//article')

    for bks in all_div_names:
        name = bks.xpath('.//span[@class="css-fwbz9r"]/text()').get()
        price = bks.xpath('.//h2[@class="css-yr18fa"]/text()').get()
        sqft = bks.xpath('.//div[@class="css-1ty8tu4"]/text()').get()
        bhk = bks.xpath('.//a[@class="css-163eyf0"]/text()').get()

        yield {
            'ttname': name,
            'ttprice': price,
            'ttsqft': sqft,
            'ttbhk': bhk
              }

Upvotes: 0

Samsul Islam
Samsul Islam

Reputation: 2619

You use a for loop but not use for loop variable 'bks'.

    for bks in all_div_names:
        name = bks.xpath('//span[@class="css-fwbz9r"]/text()').extract()
        price = bks.xpath('//h2[@class="css-yr18fa"]/text()').extract()
        sqft = bks.xpath('//div[@class="css-1ty8tu4"]/text()').extract()
        bhk = bks.xpath('//a[@class="css-163eyf0"]/text()').extract()

Here is our output.

{'ttname': ['Jodhpur Village, Jodhpur, Ahmedabad', 'Shapers Swastik Platinum, Narolgam, Ahmedabad', 'Gayatri Maitri Lake View, Zundal, Ahmedabad', 'Puspak Platinum , Ambli, Ahmedabad', 'arjun greens, Naranpura, Ahmedabad', 'Aariyana Lakeside, Shilaj, Ahmedabad', 'Ganesh Malabar County II, Chharodi, Ahmedabad', 'Jodhpur Village, Jodhpur, Ahmedabad', 'Ratna Paradise, Khoraj, Ahmedabad', 'Thaltej, Ahmedabad', 'Binori Solitaire, Bopal, Ahmedabad', 'Arvind & Safal Parishkaar Apartments, Amraiwadi, Ahmedabad', 'Siddhivinayak Omkar Lotus, Chandkheda, Ahmedabad', 'Orchid Whitefield , Prahlad Nagar, Ahmedabad', 'VISHWAS CITY , Gota, Ahmedabad', 'Gala Aria, Bopal, Ahmedabad', 'Ganesh Malabar County, Chharodi, Ahmedabad', 'Devnandan Infinity , Motera, Ahmedabad', 'Sapphire Swapneel Elysium, Bopal, Ahmedabad', 'Veer Mahavir Hills 2, Koba, Ahmedabad'], 'ttprice': ['₹95.0 L', '₹17.0 L', '₹28.75 L', '₹1.4 Cr', '₹1.0 Cr', '₹3.5 Cr', '₹43.0 L', '₹47.5 L', '₹1.55 Cr', '₹65.0 L', '₹1.1 Cr', '₹42.0 L', '₹74.0 L', '₹50.0 L', '₹30.0 L', '₹1.18 Cr', '₹47.0 L', '₹50.0 L', '₹81.0 L', '₹33.0 L'], 'ttsqft': ['1750 sq.ft', '₹5.43 K/sq.ft', '870 sq.ft', '₹1.95 K/sq.ft', '1125 sq.ft', '₹2.56 K/sq.ft', '2250 sq.ft', '₹6.22 K/sq.ft', '1812 sq.ft', '₹5.52 K/sq.ft', '4275 sq.ft', '₹8.19 K/sq.ft', '1170 sq.ft', '₹3.67 K/sq.ft', '1200 sq.ft', '₹3.96 K/sq.ft', '3340 sq.ft', '₹4.64 K/sq.ft', '1710 sq.ft', '₹3.80 K/sq.ft', '2214 sq.ft', '₹4.97 K/sq.ft', '1108 sq.ft', '₹3.79 K/sq.ft', '1960 sq.ft', '₹3.77 K/sq.ft', '1050 sq.ft', '₹4.76 K/sq.ft', '954 sq.ft', '₹3.14 K/sq.ft', '2115 sq.ft', '₹5.58 K/sq.ft', '1168 sq.ft', '₹4.02 K/sq.ft', '1323 sq.ft', '₹3.78 K/sq.ft', '1800 sq.ft', '₹4.50 K/sq.ft', '1215 sq.ft', '₹2.72 K/sq.ft'], 'ttbhk': ['3 BHK Apartment', '2 BHK Apartment', '2 BHK Apartment', '3 BHK Apartment', '3 BHK Apartment', '4 BHK Apartment', '2 BHK Apartment', '2 BHK Apartment', '4 BHK Apartment', '3 BHK Apartment', '3 BHK Apartment', '2 BHK Apartment', '3 BHK Apartment', '2 BHK Apartment', '2 BHK Apartment', '3 BHK Apartment', '2 BHK Apartment', '2 BHK Apartment', '3 BHK Apartment', '2 BHK Apartment'],

Upvotes: 1

Related Questions