Script throws some error at some point within the execution

Question

I've created a script in python using pyppeteer to collect the links of different posts from a webpage and then parse the title of each post by going in their target page reusing those collected links. Although the content are static, I like to know how pyppeteer works in such cases.

I tried to supply this browser variable from main() function to fetch() and browse_all_links() function so that I can reuse the same browser over and over again.

My current approach:

import asyncio
from pyppeteer import launch

url = "https://stackoverflow.com/questions/tagged/web-scraping"

async def fetch(page,url):
    await page.goto(url)
    linkstorage = []
    await page.waitForSelector('.summary .question-hyperlink')
    elements = await page.querySelectorAll('.summary .question-hyperlink')
    for element in elements:
        linkstorage.append(await page.evaluate('(element) => element.href', element))
    return linkstorage

async def browse_all_links(page,link):
    await page.goto(link)
    await page.waitForSelector('h1 > a')
    title = await page.querySelectorEval('h1 > a','(e => e.innerText)')
    print(title)

async def main():
    browser = await launch(headless=False,autoClose=False)
    [page] = await browser.pages()
    links = await fetch(page,url)
    tasks = [await browse_all_links(page,url) for url in links]
    await asyncio.gather(*tasks)

if __name__ == '__main__':
    asyncio.run(main())

The above script fetches some titles but spits out the following error at some point within the execution:

Possible to select  with specific text within the quotes?
Crawler Runs Too Slow
How do I loop a list of ticker to scrape balance sheet info?
How to retrive the url of searched video from youtbe using python
VBA-JSON to import data from all pages in one table
Is there an algorithm that detects semantic visual blocks in a webpage?
find_all only scrape the last value

#ERROR STARTS

Future exception was never retrieved
future: 
pyppeteer.errors.NetworkError: Protocol error (Runtime.releaseObject): Cannot find context with specified id
Future exception was never retrieved

TwinckleTwinckle · Accepted Answer

AS it's been two days since this question has been posted but no one yet to answer, I will take this opportunity to address this issue what I think might be helpful to you.

There are 15 links but you are getting only 7, this is probably websockets is loosing connection and page is not reachable anymore
List comprehension

tasks = [await browse_all_links(page,url) for url in links] What do expect is this list? If it's succesful, it will be a list of none element. So your next line of code will throw error!

Solution

downgrade websockets 7.0 to websockets 6.0

remove this line of code await asyncio.gather(*tasks)

I am using python 3.6, so I had to change last line of code. You don't need change it if you are using python 3.7 which I think you are using

import asyncio
from pyppeteer import launch

url = "https://stackoverflow.com/questions/tagged/web-scraping"

async def fetch(page,url):
    await page.goto(url)
    linkstorage = []
    await page.waitForSelector('.summary .question-hyperlink')
    elements = await page.querySelectorAll('.summary .question-hyperlink')
    for element in elements:
        linkstorage.append(await page.evaluate('(element) => element.href', element))
    return linkstorage

async def browse_all_links(page,link):
    await page.goto(link)
    await page.waitForSelector('h1 > a')
    title = await page.querySelectorEval('h1 > a','(e => e.innerText)')
    print(title)
async def main():
    browser = await launch(headless=False,autoClose=False)
    [page] = await browser.pages()
    links = await fetch(page,url)
    tasks = [await browse_all_links(page,url) for url in links]
    #await asyncio.gather(*tasks)
    await browser.close()
if __name__ == '__main__':
    #asyncio.run(main())
    asyncio.get_event_loop().run_until_complete(main())

Output

(testenv) C:\Py\pypuppeteer1>python stack3.py Scrapy Shell response.css returns an empty array Scrapy real-time spider Why do I get KeyError while reading data with get request? Scrapy spider can't redefine custom_settings according to args Custom JS Script using Lua in Splash UI Can someone explain why and how this piece of code works [on hold] How can I extract required data from a list of strings? Scrapy CrawlSpider rules for crawling single page how to scrape a web-page with search bar results, when the search query does not appear in the url Nested for loop keeps repeating Get all tags except a list of tags BeautifulSoup Get current URL using Python and webbot How to login to site and send data Unable to append value to colums. Getting error IndexError: list index out of ra nge NextSibling.Innertext not working. “Object doesn't support this property”

Script throws some error at some point within the execution

Answers (1)

Related Questions