BrianEbrahimi
BrianEbrahimi

Reputation: 67

Scrapy Trying to get Json Response

I am using a scraper to scrape the steam gaming platform, and I am having trouble with pagination. The pagination from this link: https://steamcommunity.com/sharedfiles/filedetails/comments/2460661464 uses pagination, and I believe is making a POST request to some server. I would like to simulate this request using Scrapy's FormRequest function, and get all of the comments at once. I don't know how to do this. what should my headers and formdata look like? Currently they look like this:

 headers = {
                'Accept':   'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                'Accept-Encoding':  'gzip, deflate, br',
                'Accept-Language':  'en-US,en;q=0.5',
                'Connection':   'keep-alive',
                'Host': 'steamcommunity.com',
                'Upgrade-Insecure-Requests':    '1',
                'User-Agent':   'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0'
            }

            data = {
                "start": "0",
                "totalcount": comment_number,
                "count": comment_number,
                "sessionid": "d880ab2338b70926db0a9591",
                f"extended_data": "{\"contributors\":[\"{contributor_id}\",{}],\"appid\":289070,\"sharedfile\":{\"m_parentsDetails\":null,\"m_parentBundlesDetails\":null,\"m_bundledChildren\":[],\"m_ownedBundledItems\":[]},\"parent_item_reported\":false}",
                "feature2": "-1"
            }
yield FormRequest(url, formdata=data, headers=headers, callback=self.parse_paginated_comments, dont_filter=True, meta={'app_id': app_id, 'game': game, 'workshop_id': workshop_id, 'workshop_name': workshop_name})


What are the correct headers/data and how do I set up my FormRequest to get all of the comments (in this case 1-134)?

Upvotes: 1

Views: 124

Answers (1)

Paul M.
Paul M.

Reputation: 10799

I don't know anything about Scrapy, but here's how you could do it using just basic requests and BeautifulSoup.

The API doesn't seem to be very strict about the payload that's POSTed. Even if some parameters are omitted, the API doesn't seem to mind. I've found that you can assign an impossibly large number to the count parameter to have the API return all comments (assuming there will never be more than 99999999 comments in a thread, in this case). I haven't played around with the request headers that much - you could probably trim them down even further.

def get_comments(thread_id):
    import requests
    from bs4 import BeautifulSoup as Soup

    url = "https://steamcommunity.com/comment/PublishedFile_Public/render/76561198401810552/{}/".format(thread_id)

    headers = {
        "Accept": "text/javascript, text/html, application/xml, text/xml, */*",
        "Accept-Encoding": "gzip, deflate",
        "Content-type": "application/x-www-form-urlencoded; charset=UTF-8",
        "User-Agent": "Mozilla/5.0",
        "X-Requested-With": "XMLHttpRequest"
    }

    payload = {
        "start": "0",
        "count": "99999999",
    }

    def to_clean_comment(element):
        return element.text.strip()

    
    response = requests.post(url, headers=headers, data=payload)
    response.raise_for_status()
    
    soup = Soup(response.json()["comments_html"], "html.parser")
    yield from map(to_clean_comment, soup.select("div.commentthread_comment_text"))


def main():
    for comment in get_comments("2460661464"):
        print(comment)
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Upvotes: 2

Related Questions