Reputation: 25
I am trying to scrape new posts from pastebin by using there api. Its working pretty well however I keep getting duplicate posts. I am right now trying to compare two lists and tell which lists haven't changed, however it makes it so It alternates posts. How do I fix my method for comparing lists so I can get the most recent pastes without getting alternate repeats? Here is my current code.
old_response = []
while True:
try:
response = s.get("http://scrape.pastebin.com/api_scraping.php?limit=5").json()
for x in old_response:
response.remove(x)
response.remove(old_response)
for i in range(len(response)):
print(i)
time.sleep(2.5)
logger.info("Posted Link")
#thread = threading.Thread(target=communicate,args=(response, i))
#thread.start()
#thread.join()
old_response = response[:]
except Exception as e:
logger.critical(f"ERROR: {e}")
pass
Also since the api is private, Ill just show what a simple response would be. Lets say you scrape 2 results. It will return the two latest results as something like this:
[
{
"scrape_url": "https://scrape.pastebin.com/api_scrape_item.php?i=J2CeszTZ",
"full_url": "https://pastebin.com/J2CeszTZ",
"date": "1585606093",
"key": "J2CeszTZ",
"size": "98",
"expire": "0",
"title": "",
"syntax": "text",
"user": "irismar"
},
{
"scrape_url": "https://scrape.pastebin.com/api_scrape_item.php?i=hYJ7Xcmm",
"full_url": "https://pastebin.com/hYJ7Xcmm",
"date": "1585606099",
"key": "hYJ7Xcmm",
"size": "1371",
"expire": "0",
"title": "",
"syntax": "php",
"user": ""
}
]
Heres a simple response, if we refresh our url (http://scrape.pastebin.com/api_scraping.php?limit=2) it will return us the two latest results:
[
{
"scrape_url": "https://scrape.pastebin.com/api_scrape_item.php?i=ZcMJxCwc",
"full_url": "https://pastebin.com/ZcMJxCwc",
"date": "1585606208",
"key": "ZcMJxCwc",
"size": "266166",
"expire": "1585606808",
"title": "OpenEuCalendar",
"syntax": "text",
"user": "scholzsebastian"
},
{
"scrape_url": "https://scrape.pastebin.com/api_scrape_item.php?i=qY5VdbSk",
"full_url": "https://pastebin.com/qY5VdbSk",
"date": "1585606143",
"key": "qY5VdbSk",
"size": "25",
"expire": "0",
"title": "Online jobs",
"syntax": "text",
"user": ""
}
]
If I work with a lot of data sets It often alternates posts, im trying to only detect new posts and not save repeated pastes. Any help would be appreciated.
Upvotes: 0
Views: 123
Reputation: 338386
I would set up a dictionary to collect keys and dates of pastes. When the server returns an item we already know (same key and date), we skip it.
This works best when set up the whole thing as a generator:
import time
import json
import requests
import logging
def scraper():
seen_items = {}
api_url = "http://scrape.pastebin.com/api_scraping.php"
while True:
try:
response = requests.get(api_url, {'limit': 5})
for item in response.json():
last_known_date = seen_items.get(item['key'])
if item['date'] != last_known_date:
seen_items[item['key']] = item['date']
yield item
time.sleep(2.5)
except json.JSONDecodeError as e:
logging.error(f"Server response: {response.text}")
return
Now we can iterate the items as if they were a list:
for item in scraper():
print(item)
Todo
except Exception
, that's too generic.time.sleep(2.5)
.seen_items
out of the function and storing it somewhere.Upvotes: 1
Reputation: 3593
Instead of removing from the current, I would add to the old list when a new item in response
shows up. Something like:
old_response = []
while True:
try:
response = s.get("http://scrape.pastebin.com/api_scraping.php?limit=5").json()
for record in response:
if record in old_response:
# we have seen it already, skip it then
continue
# We haven't seen it, so let's add it
old_response.append(record)
for i in range(len(response)):
print(i)
time.sleep(2.5)
logger.info("Posted Link")
#thread = threading.Thread(target=communicate,args=(response, i))
#thread.start()
#thread.join()
# This should not be needed anymore
# old_response = response[:]
except Exception as e:
logger.critical(f"ERROR: {e}")
pass
Upvotes: 1