Jimmy Hankey
Jimmy Hankey

Reputation: 25

Webscraping JSON

I am trying to scrape new posts from pastebin by using there api. Its working pretty well however I keep getting duplicate posts. I am right now trying to compare two lists and tell which lists haven't changed, however it makes it so It alternates posts. How do I fix my method for comparing lists so I can get the most recent pastes without getting alternate repeats? Here is my current code.


old_response = []
while True:
    try:
        response = s.get("http://scrape.pastebin.com/api_scraping.php?limit=5").json()

        for x in old_response:
            response.remove(x)
        response.remove(old_response)


        for i in range(len(response)):
            print(i)
            time.sleep(2.5)
            logger.info("Posted Link")
            #thread = threading.Thread(target=communicate,args=(response, i))
            #thread.start()
            #thread.join()
        old_response = response[:]
    except Exception as e:
        logger.critical(f"ERROR: {e}")
        pass

Also since the api is private, Ill just show what a simple response would be. Lets say you scrape 2 results. It will return the two latest results as something like this:

[
    {
        "scrape_url": "https://scrape.pastebin.com/api_scrape_item.php?i=J2CeszTZ",
        "full_url": "https://pastebin.com/J2CeszTZ",
        "date": "1585606093",
        "key": "J2CeszTZ",
        "size": "98",
        "expire": "0",
        "title": "",
        "syntax": "text",
        "user": "irismar"
    },
    {
        "scrape_url": "https://scrape.pastebin.com/api_scrape_item.php?i=hYJ7Xcmm",
        "full_url": "https://pastebin.com/hYJ7Xcmm",
        "date": "1585606099",
        "key": "hYJ7Xcmm",
        "size": "1371",
        "expire": "0",
        "title": "",
        "syntax": "php",
        "user": ""
    }
]


Heres a simple response, if we refresh our url (http://scrape.pastebin.com/api_scraping.php?limit=2) it will return us the two latest results:


[
    {
        "scrape_url": "https://scrape.pastebin.com/api_scrape_item.php?i=ZcMJxCwc",
        "full_url": "https://pastebin.com/ZcMJxCwc",
        "date": "1585606208",
        "key": "ZcMJxCwc",
        "size": "266166",
        "expire": "1585606808",
        "title": "OpenEuCalendar",
        "syntax": "text",
        "user": "scholzsebastian"
    },
    {
        "scrape_url": "https://scrape.pastebin.com/api_scrape_item.php?i=qY5VdbSk",
        "full_url": "https://pastebin.com/qY5VdbSk",
        "date": "1585606143",
        "key": "qY5VdbSk",
        "size": "25",
        "expire": "0",
        "title": "Online jobs",
        "syntax": "text",
        "user": ""
    }
]

If I work with a lot of data sets It often alternates posts, im trying to only detect new posts and not save repeated pastes. Any help would be appreciated.

Upvotes: 0

Views: 123

Answers (2)

Tomalak
Tomalak

Reputation: 338386

I would set up a dictionary to collect keys and dates of pastes. When the server returns an item we already know (same key and date), we skip it.

This works best when set up the whole thing as a generator:

import time
import json
import requests
import logging

def scraper():
    seen_items = {}
    api_url = "http://scrape.pastebin.com/api_scraping.php"

    while True:
        try:
            response = requests.get(api_url, {'limit': 5})
            for item in response.json():
                last_known_date = seen_items.get(item['key'])
                if item['date'] != last_known_date:
                    seen_items[item['key']] = item['date']
                    yield item
            time.sleep(2.5)
        except json.JSONDecodeError as e:
            logging.error(f"Server response: {response.text}")
            return

Now we can iterate the items as if they were a list:

for item in scraper():
    print(item)

Todo

  • Add other error handlers individually. Avoid except Exception, that's too generic.
  • Add a smarter timing mechanism than time.sleep(2.5).
  • Maybe add persistence by moving seen_items out of the function and storing it somewhere.

Upvotes: 1

sal
sal

Reputation: 3593

Instead of removing from the current, I would add to the old list when a new item in response shows up. Something like:

old_response = []
while True:
    try:
        response = s.get("http://scrape.pastebin.com/api_scraping.php?limit=5").json()

        for record in response:
            if record in old_response:
               # we have seen it already, skip it then
               continue

            # We haven't seen it, so let's add it
            old_response.append(record)

        for i in range(len(response)):
            print(i)
            time.sleep(2.5)
            logger.info("Posted Link")
            #thread = threading.Thread(target=communicate,args=(response, i))
            #thread.start()
            #thread.join()

        # This should not be needed anymore
        # old_response = response[:]

    except Exception as e:
        logger.critical(f"ERROR: {e}")
        pass

Upvotes: 1

Related Questions