blablaalb
blablaalb

Reputation: 182

Sending post requests with Scrapy

I'm learning how to do web scraping with Scrapy and I'm having problems with scraping dynamically loaded content. I'm trying to scrape a phone number from a website which sends a POST request in order to obtain the number: This is the header of the Post request it sends:

Host: www.mymarket.ge
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0
Accept: */*
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Referer: https://www.mymarket.ge/en/pr/16399126/savaWro-inventari/fulis-yuTi
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
X-Requested-With: XMLHttpRequest
Content-Length: 13
Origin: https://www.mymarket.ge
Connection: keep-alive
Cookie: Lang=en; split_test_version=v1; CookieID=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJEYXRhIjp7IklEIjozOTUwMDY2MzUsImN0IjoxNTkyMzA2NDMxfSwiVG9rZW5JRCI6Ik55empxVStDa21QT1hKaU9lWE56emRzNHNSNWtcL1wvaVVUYjh2dExCT3ZKWT0iLCJJc3N1ZWRBdCI6MTU5MjMyMTc1MiwiRXhwaXJlc0F0IjoxNTkyMzIyMDUyfQ.mYR-I_51WLQbzWi-EH35s30soqoSDNIoOyXgGQ4Eu84; ka=da; SHOW_BETA_POPUP=B; APP_VERSION=B; LastSearch=%7B%22CatID%22%3A%22515%22%7D; PHPSESSID=eihhfcv85liiu3kt55nr9fhu5b; PopUpLog=%7B%22%2A%22%3A%222020-05-07+15%3A13%3A29%22%7D

and this is the body:

PrID=16399126

I successfully managed to replicate the post request on reqbin.com, but can't figure out how to do it with Scrapy. This is what my code looks like:

class MymarketcrawlerSpider(CrawlSpider):
    name = "mymarketcrawler"
    allowed_domains = ["mymarket.ge"]
    start_urls = ["http://mymarket.ge/"]

    rules = (
        Rule(
            LinkExtractor(allow=r".*mymarket.ge/ka/*", restrict_css=".product-card"),
            callback="parse_item",
            follow=True,
        ),
    )

    def parse_item(self, response):
        item_loader = ItemLoader(item=MymarketItem(), response=response)

        def parse_num(response):
            try:
                response_text = response.text
                response_dict = ast.literal_eval(response_text)
                number = response_dict['Data']['Data']['numberToShow']
                nonlocal item_loader
                item_loader.add_value("number", number)
                yield item_loader.load_item()
            except Exception as e:
                raise CloseSpider(e)


        yield FormRequest.from_response(
            response,
            url=r"https://www.mymarket.ge/ka/pr/ShowFullNumber/",
            headers={
                "Host": "www.mymarket.ge",
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",
                "Accept": "*/*",
                "Accept-Language": "en-US,en;q=0.5",
                "Accept-Encoding": "gzip, deflate, br",
                "Referer": "https://www.mymarket.ge/ka/pr/16399126/savaWro-inventari/fulis-yuTi",
                "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
                "X-Requested-With": "XMLHttpRequest",
            },
            formdata={"PrID": "16399126"},
            method="POST",
            dont_filter=True,
            callback=parse_num
        )
        item_loader.add_xpath(
            "seller", "//div[@class='d-flex user-profile']/div/span/text()"
        )
        item_loader.add_xpath(
            "product",
            "//div[contains(@class, 'container product')]//h1[contains(@class, 'product-title')]/text()",
        )
        item_loader.add_xpath(
            "price",
            "//div[contains(@class, 'container product')]//span[contains(@class, 'product-price')][1]/text()",
            TakeFirst(),
        )
        item_loader.add_xpath(
            "images",
            "//div[@class='position-sticky']/ul[@id='imageGallery']/li/@data-src",
        )
        item_loader.add_xpath(
            "condition", "//div[contains(@class, 'condition-label')]/text()"
        )
        item_loader.add_xpath(
            "city",
            "//div[@class='d-flex font-14 font-weight-medium location-views']/span[contains(@class, 'location')]/text()",
        )
        item_loader.add_xpath(
            "number_of_views",
            "//div[@class='d-flex font-14 font-weight-medium location-views']/span[contains(@class, 'svg-18')]/span/text()",
        )
        item_loader.add_xpath(
            "publish_date",
            "//div[@class='d-flex left-side']//div[contains(@class, 'font-12')]/span[2]/text()",
        )
        item_loader.add_xpath(
            "total_products_amount",
            "//div[contains(@class, 'user-profile')]/div/a/text()",
            re=r"\d+",
        )
        item_loader.add_xpath(
            "description", "//div[contains(@class, 'texts full')]/p/text()"
        )
        item_loader.add_value("url", response.url)
        yield item_loader.load_item()

The code above doesn't work; The number field is not populated. I can print out the number to the screen, but unable to save it to the csv file. The number column in the csv file is blank, it doesn't contain any values.

Upvotes: 0

Views: 78

Answers (1)

borisdonchev
borisdonchev

Reputation: 1224

Scrapy works asynchronously and every link to crawl, every item to process, etc. is put inside a queue. That is why you yield a request and wait for a SpiderDownloader, ItemPipeline, etc. to process your request.

What is happening is that you have requests that are processed seperately and that is why you don't see your results. Personally I would parse the results from the first request, save them in the 'meta' data and pass them to the next request, so that the data is available afterwards.

E.g.

class MymarketcrawlerSpider(CrawlSpider):
    name = "mymarketcrawler"
    allowed_domains = ["mymarket.ge"]
    start_urls = ["http://mymarket.ge/"]

    rules = (
        Rule(
            LinkExtractor(allow=r".*mymarket.ge/ka/*", restrict_css=".product-card"),
            callback="parse_item",
            follow=True,
        ),
    )

    def parse_item(self, response):

        def parse_num(response):
            item_loader = ItemLoader(item=MymarketItem(), response=response)
            try:
                response_text = response.text
                response_dict = ast.literal_eval(response_text)
                number = response_dict['Data']['Data']['numberToShow']
                # New part: 
                product = response.meta['product']             

                # You won't need this now: nonlocal item_loader
                # Also new: 
                item_loader.add_value("number", number)

                item_loader.add_value("product", product)
                yield item_loader.load_item()
            except Exception as e:
                raise CloseSpider(e)
        # Rewrite your parsers like this: 
        product = response.xpath(
            "//div[contains(@class, 'container product')]//h1[contains(@class, 'product-title')]/text()"
        ).get()

        yield FormRequest.from_response(
            response,
            url=r"https://www.mymarket.ge/ka/pr/ShowFullNumber/",
            headers={
                "Host": "www.mymarket.ge",
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",
                "Accept": "*/*",
                "Accept-Language": "en-US,en;q=0.5",
                "Accept-Encoding": "gzip, deflate, br",
                "Referer": "https://www.mymarket.ge/ka/pr/16399126/savaWro-inventari/fulis-yuTi",
                "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
                "X-Requested-With": "XMLHttpRequest",
            },
            formdata={"PrID": "16399126"},
            method="POST",
            dont_filter=True,
            callback=parse_num,
            meta={"product": product}
        )

Upvotes: 1

Related Questions