Keenan Burke-Pitts
Keenan Burke-Pitts

Reputation: 475

Unexpected Behavior When Scraping Pages with Scrapy

I am struggling to get my code to extract out data from each json object of the current request call and then once it goes through each json object, move on to the next request for the next batch of json objects. It appears that my script just scrapes the first request call over and over. Can someone assist me what I'm missing in my for loop and/or while loop? Thanks in advance!!

import scrapy
import json
import requests
import re
from time import sleep
import sys

class LetgoSpider(scrapy.Spider):
    name = 'letgo'
    allowed_domains = ['letgo.com/en', 'search-products-pwa.letgo.com']
    start_urls = ['https://search-products-pwa.letgo.com/api/products?country_code=US&offset=0&quadkey=0320030123201&num_results=50&distance_radius=50&distance_type=mi']
    offset = 0

    def parse(self, response):
      data = json.loads(response.text)
      if len(data) == 0:
        sys.exit()
      else:
        for used_item in data:
              try:
                  if used_item['name'] == None:
                      title = used_item['image_information']
                  else:
                      title = used_item['name']
                  id_number = used_item['id']
                  price = used_item['price']
                  description = used_item['description']
                  date = used_item['updated_at']
                  images = [img['url'] for img in used_item['images']]
                  latitude = used_item['geo']['lat']
                  longitude = used_item['geo']['lng']
                  link = 'https://us.letgo.com/en/i/' + re.sub(r'\W+', '-', title) + '_' + id_number
                  location = used_item['geo']['city']
              except:
                  pass

              yield {'Title': title,
                      'Url': link,
                      'Price': price,
                      'Description': description,
                      'Date': date,
                      'Images': images,
                      'Latitude': latitude,
                      'Longitude': longitude,
                      'Location': location,
                      }    

      self.offset += 50
      new_request = 'https://search-products-pwa.letgo.com/api/products?country_code=US&offset=' + str(self.offset) + \
                      '&quadkey=0320030123201&num_results=50&distance_radius=50&distance_type=mi'
      print('new request is: ' + new_request)
      sleep(1)
      yield scrapy.Request(new_request, callback=self.parse)

Upvotes: 0

Views: 131

Answers (1)

Nour Wolf
Nour Wolf

Reputation: 2180

Try running this code. I have only cleaned it a bit.

import json
import re

import scrapy


class LetgoSpider(scrapy.Spider):
    name = 'letgo'
    allowed_domains = ['letgo.com/en', 'search-products-pwa.letgo.com']
    search_url = 'https://search-products-pwa.letgo.com/api/products' \
                 '?country_code=US' \
                 '&offset={offset}' \
                 '&quadkey=0320030123201' \
                 '&num_results={num_results}' \
                 '&distance_radius=50' \
                 '&distance_type=mi'
    offset = 0
    num_results = 5
    max_pages = 3
    start_urls = [
        search_url.format(offset=offset, num_results=num_results)
    ]
    custom_settings = {
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1',
        'LOG_LEVEL': 'INFO',
    }

    def parse(self, response):
        data = json.loads(response.text)
        for used_item in data:
            try:
                title = used_item['name'] or used_item['image_information']
                id_number = used_item['id']
                price = used_item['price']
                description = used_item['description']
                date = used_item['updated_at']
                images = [img['url'] for img in used_item['images']]
                latitude = used_item['geo']['lat']
                longitude = used_item['geo']['lng']
                link = 'https://us.letgo.com/en/i/' + re.sub(r'\W+', '-', title) + '_' + id_number
                location = used_item['geo']['city']
            except KeyError:
                pass
            else:
                item = {
                    'Title': title,
                    'Url': link,
                    'Price': price,
                    'Description': description,
                    'Date': date,
                    'Images': images,
                    'Latitude': latitude,
                    'Longitude': longitude,
                    'Location': location,
                }
                print(item)
                yield item

        self.offset += self.num_results
        if self.offset > self.num_results * self.max_pages:
            return

        next_page_url = self.search_url.format(offset=self.offset, num_results=self.num_results)
        yield scrapy.Request(url=next_page_url, callback=self.parse)

Here are the logs when running it

/Volumes/Dev/miniconda3/envs/scm/bin/python -m scrapy runspider sc.py
2018-02-22 00:46:23 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-02-22 00:46:23 [scrapy.utils.log] INFO: Versions: lxml 4.1.1.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:14:59) - [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0g  2 Nov 2017), cryptography 2.1.4, Platform Darwin-17.4.0-x86_64-i386-64bit
2018-02-22 00:46:23 [scrapy.crawler] INFO: Overridden settings: {'LOG_LEVEL': 'INFO', 'SPIDER_LOADER_WARN_ONLY': True, 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'}
2018-02-22 00:46:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2018-02-22 00:46:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-02-22 00:46:23 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-02-22 00:46:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-02-22 00:46:23 [scrapy.core.engine] INFO: Spider opened
2018-02-22 00:46:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
{'Title': '54 Inch Light Bar', 'Url': 'https://us.letgo.com/en/i/54-Inch-Light-Bar_fbe7f2b2-29b4-4a39-a1c6-77e8fde56ab5', 'Price': 80, 'Description': '54 Inch Light Bar...New never been installed...Call or Text  [TL_HIDDEN] ', 'Date': '2018-02-21T23:38:46+00:00', 'Images': ['https://img.letgo.com/images/72/94/6c/90/72946c90a739a4710ca709af1e87ffca.jpeg'], 'Latitude': 35.5362217, 'Longitude': -82.8092321, 'Location': 'Canton'}
{'Title': 'Jr Tour Golf Clubs', 'Url': 'https://us.letgo.com/en/i/Jr-Tour-Golf-Clubs_40324f63-3b18-401a-bdad-900d58fa9be1', 'Price': 40, 'Description': 'Right handed golf clubs ', 'Date': '2018-02-21T23:38:20+00:00', 'Images': ['https://img.letgo.com/images/33/8a/cf/6f/338acf6fc7959626683fbe857480e9a9.jpeg', 'https://img.letgo.com/images/60/7d/37/b1/607d37b1281fce2b48a045398d49ff4c.jpeg', 'https://img.letgo.com/images/ae/de/60/b1/aede60b1260124bfdbacbc7a9aaf25c8.jpeg', 'https://img.letgo.com/images/f0/3e/2c/03/f03e2c031e1976986e25f9f12b1ddd20.jpeg'], 'Latitude': 35.657392629984, 'Longitude': -82.705151547089, 'Location': 'Leicester'}
{'Title': 'Glass vase', 'Url': 'https://us.letgo.com/en/i/Glass-vase_ebaad5f6-afc0-42cb-99b2-aae9ce0cec31', 'Price': 80, 'Description': '', 'Date': '2018-02-21T23:37:20+00:00', 'Images': ['https://img.letgo.com/images/97/fa/68/82/97fa6882b38be80a6084ffa605a94fae.jpeg', 'https://img.letgo.com/images/68/35/a5/d6/6835a5d65f8443abe12e1afa69eb75cd.jpeg'], 'Latitude': 35.580766432121, 'Longitude': -82.622580964386, 'Location': 'Asheville'}
{'Title': "women's pink and black polka-dot long-sleeved top", 'Url': 'https://us.letgo.com/en/i/women-s-pink-and-black-polka-dot-long-sleeved-top_d33d05a3-a362-487d-af3c-10f70c1edc54', 'Price': 2, 'Description': '18 months ', 'Date': '2018-02-21T23:37:01+00:00', 'Images': ['https://img.letgo.com/images/87/e4/44/21/87e44421d0bae79bce09424b39ad9bd8.jpeg'], 'Latitude': 35.5135800231, 'Longitude': -82.68708409485, 'Location': 'Candler'}
{'Title': 'yellow and black DeWalt power tool kit set', 'Url': 'https://us.letgo.com/en/i/yellow-and-black-DeWalt-power-tool-kit-set_45a070fc-8d45-479d-8453-0d52e899423a', 'Price': 115, 'Description': '110-115. I have a bag to fit it all for a I total of 130', 'Date': '2018-02-21T23:36:12+00:00', 'Images': ['https://img.letgo.com/images/bc/2f/69/71/bc2f6971e2891e9bb80205ba03d6c209.jpeg', 'https://img.letgo.com/images/0d/4c/0c/f2/0d4c0cf2536c29320fdd7fffa05cb242.jpeg', 'https://img.letgo.com/images/53/0e/97/78/530e9778c5e5266eaad92afa6ccb0405.jpeg', 'https://img.letgo.com/images/58/93/62/05/58936205711631e148bd5a17cf5d8d14.jpeg'], 'Latitude': 35.580774319984, 'Longitude': -82.62263189396, 'Location': 'Asheville'}
{'Title': "girl's gray and white Calvin Klein sweater", 'Url': 'https://us.letgo.com/en/i/girl-s-gray-and-white-Calvin-Klein-sweater_2ee6a5dd-bec7-4a0b-a575-38ceacebc193', 'Price': 3, 'Description': '12 months ', 'Date': '2018-02-21T23:36:11+00:00', 'Images': ['https://img.letgo.com/images/19/a4/83/0d/19a4830dc0fcc598218ba2ad49566dcf.jpeg'], 'Latitude': 35.513783889312, 'Longitude': -82.686794813796, 'Location': 'Candler'}
{'Title': "toddler's blue, pink, and white floral embellished denim bib overalls", 'Url': 'https://us.letgo.com/en/i/toddler-s-blue-pink-and-white-floral-embellished-denim-bib-overalls_6551c032-0de2-4b25-b4d6-29e39860d0cc', 'Price': 5, 'Description': '18 months ', 'Date': '2018-02-21T23:35:38+00:00', 'Images': ['https://img.letgo.com/images/2d/d3/84/3a/2dd3843a82031d3c88f96822d5dbff3c.jpeg'], 'Latitude': 35.513783889312, 'Longitude': -82.686794813796, 'Location': 'Candler'}
{'Title': 'red and black dog print pajama set', 'Url': 'https://us.letgo.com/en/i/red-and-black-dog-print-pajama-set_8020d458-b135-4d3e-a057-bb559a85156a', 'Price': 5, 'Description': '18 months ', 'Date': '2018-02-21T23:35:10+00:00', 'Images': ['https://img.letgo.com/images/14/ee/c5/c3/14eec5c3b94337050766c5dd4932b2cb.jpeg'], 'Latitude': 35.513783889312, 'Longitude': -82.686794813796, 'Location': 'Candler'}
{'Title': 'black, pink, and green floral dress', 'Url': 'https://us.letgo.com/en/i/black-pink-and-green-floral-dress_ea495806-20ff-4ee8-accb-d29e437f93af', 'Price': 3, 'Description': '12-18 months ', 'Date': '2018-02-21T23:34:45+00:00', 'Images': ['https://img.letgo.com/images/22/6f/7b/28/226f7b28e93213c9de571da0d58c1483.jpeg'], 'Latitude': 35.513783889312, 'Longitude': -82.686794813796, 'Location': 'Candler'}
{'Title': "girl's black and white Minnie Mouse polka-dot crew-neck dress", 'Url': 'https://us.letgo.com/en/i/girl-s-black-and-white-Minnie-Mouse-polka-dot-crew-neck-dress_c3affc21-ab01-434c-9252-327c77b0f014', 'Price': 4, 'Description': '12 months ', 'Date': '2018-02-21T23:34:10+00:00', 'Images': ['https://img.letgo.com/images/d8/56/92/51/d85692518e3d3e7b7dcb9200688c9ba4.jpeg'], 'Latitude': 35.513783889312, 'Longitude': -82.686794813796, 'Location': 'Candler'}
{'Title': "girl's purple and pink floral spaghetti strap dress", 'Url': 'https://us.letgo.com/en/i/girl-s-purple-and-pink-floral-spaghetti-strap-dress_cada630f-b600-4e6a-be38-9d4f2c9d9407', 'Price': 4, 'Description': '6-12 months ', 'Date': '2018-02-21T23:33:41+00:00', 'Images': ['https://img.letgo.com/images/a9/b2/3c/c1/a9b23cc1dc6de8c5443a163da54b5424.jpeg'], 'Latitude': 35.513783889312, 'Longitude': -82.686794813796, 'Location': 'Candler'}
{'Title': 'copper coil pendant necklace', 'Url': 'https://us.letgo.com/en/i/copper-coil-pendant-necklace_6e56e1f9-986c-4da6-ada0-71bf3a4ea077', 'Price': 65, 'Description': None, 'Date': '2018-02-21T23:33:21+00:00', 'Images': ['https://img.letgo.com/images/56/a5/c6/d0/56a5c6d063879645bdefa40c45a85e4a.jpeg'], 'Latitude': 35.569333, 'Longitude': -82.580862, 'Location': 'Asheville'}
{'Title': 'black and green corded hammer drill', 'Url': 'https://us.letgo.com/en/i/black-and-green-corded-hammer-drill_d6dccdce-99d1-4cbc-be01-31761ecae0e7', 'Price': 499.95, 'Description': None, 'Date': '2018-02-21T23:32:46+00:00', 'Images': ['https://img.letgo.com/images/69/df/c8/9f/69dfc89f00f514ab630646678c5f02fc.jpeg'], 'Latitude': 35.5861382, 'Longitude': -82.5974746, 'Location': 'Asheville'}
{'Title': 'Ihip Bluetooth headphones', 'Url': 'https://us.letgo.com/en/i/Ihip-Bluetooth-headphones_77493587-2400-425b-ab8d-802dec641abf', 'Price': 25, 'Description': 'Their brand new and work great none of that having to plug them into your phone they see completely wireless hust turn on your Bluetooth and listen to music or talk on the phone with the built in speaker and volume control!!\nMeet at Marshall ingles... \nFor more great stuff visit... \n', 'Date': '2018-02-21T23:30:55+00:00', 'Images': ['https://img.letgo.com/images/3d/c1/a8/93/3dc1a8936b2fded2017ef8c93ba31c9a.jpeg'], 'Latitude': 35.820196, 'Longitude': -82.629765, 'Location': 'Marshall'}
{'Title': 'Lot of 2 Pampers size 6', 'Url': 'https://us.letgo.com/en/i/Lot-of-2-Pampers-size-6_a29dcee0-ec88-4a56-8832-b14a2c300ddf', 'Price': 40, 'Description': None, 'Date': '2018-02-21T23:31:32+00:00', 'Images': ['https://img.letgo.com/images/37/31/39/02/37313902874a116c6acdcb1b1ff3a710.jpeg'], 'Latitude': 35.597118, 'Longitude': -82.516648, 'Location': 'Asheville'}
{'Title': 'Vintage candy dish', 'Url': 'https://us.letgo.com/en/i/Vintage-candy-dish_1321bf48-500b-4fcd-9704-e1466e04a51b', 'Price': 20, 'Description': 'Amber tiara pedestal candy dish. Perfect condition.', 'Date': '2018-02-21T23:29:46+00:00', 'Images': ['https://img.letgo.com/images/1c/00/13/03/1c00130383113f1e20cc1d0306b0e452.jpeg'], 'Latitude': 35.4645648, 'Longitude': -83.0014414, 'Location': 'Waynesville'}
{'Title': 'Blue and White Suzuki 400, yr 2005', 'Url': 'https://us.letgo.com/en/i/Blue-and-White-Suzuki-400-yr-2005_62dadb29-ec18-4a5d-baa7-378ce7796822', 'Price': 3700, 'Description': None, 'Date': '2018-02-21T23:29:12+00:00', 'Images': ['https://img.letgo.com/images/aa/71/34/27/aa713427b1e8af67f276febb5f1ae17a.jpeg'], 'Latitude': 35.4671172, 'Longitude': -83.0026703, 'Location': 'Waynesville'}
{'Title': 'Handmade Hemp Bracelets & Key chains', 'Url': 'https://us.letgo.com/en/i/Handmade-Hemp-Bracelets-Key-chains_d374e086-729c-4240-8e99-2699c3275ec3', 'Price': 6, 'Description': None, 'Date': '2018-02-21T23:27:42+00:00', 'Images': ['https://img.letgo.com/images/0d/32/ea/27/0d32ea2715095357e9cda3cda6598415.jpeg'], 'Latitude': 35.4833764, 'Longitude': -82.4578764, 'Location': 'Fletcher'}
{'Title': 'Handmade Hemp Necklaces', 'Url': 'https://us.letgo.com/en/i/Handmade-Hemp-Necklaces_d3c22d76-4d4d-43f7-a613-ef4d5a4e53bd', 'Price': 8, 'Description': None, 'Date': '2018-02-21T23:25:58+00:00', 'Images': ['https://img.letgo.com/images/b6/e0/8d/0a/b6e08d0a79f57215f5fc5417451fbd04.jpeg'], 'Latitude': 35.4833764, 'Longitude': -82.4578764, 'Location': 'Fletcher'}
{'Title': 'Luvs and Huggies disposable diaper packs', 'Url': 'https://us.letgo.com/en/i/Luvs-and-Huggies-disposable-diaper-packs_75204ed1-ed11-484e-81e6-cc923b923292', 'Price': 13, 'Description': None, 'Date': '2018-02-21T23:23:55+00:00', 'Images': ['https://img.letgo.com/images/3a/ce/16/a1/3ace16a18b398de6e0c8d4b56d1fa8c9.jpeg'], 'Latitude': 35.597118, 'Longitude': -82.516648, 'Location': 'Asheville'}
2018-02-22 00:46:24 [scrapy.core.engine] INFO: Closing spider (finished)
2018-02-22 00:46:24 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1977,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 7625,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 4,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 2, 21, 23, 46, 24, 468717),
 'item_scraped_count': 20,
 'log_count/INFO': 7,
 'memusage/max': 50208768,
 'memusage/startup': 50208768,
 'request_depth_max': 3,
 'response_received_count': 4,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/memory': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/memory': 4,
 'start_time': datetime.datetime(2018, 2, 21, 23, 46, 23, 770175)}
2018-02-22 00:46:24 [scrapy.core.engine] INFO: Spider closed (finished)

Process finished with exit code 0

Upvotes: 1

Related Questions