robots.txt
robots.txt

Reputation: 137

Script throws some error at some point within the execution

I've created a script in python using pyppeteer to collect the links of different posts from a webpage and then parse the title of each post by going in their target page reusing those collected links. Although the content are static, I like to know how pyppeteer works in such cases.

I tried to supply this browser variable from main() function to fetch() and browse_all_links() function so that I can reuse the same browser over and over again.

My current approach:

import asyncio
from pyppeteer import launch

url = "https://stackoverflow.com/questions/tagged/web-scraping"

async def fetch(page,url):
    await page.goto(url)
    linkstorage = []
    await page.waitForSelector('.summary .question-hyperlink')
    elements = await page.querySelectorAll('.summary .question-hyperlink')
    for element in elements:
        linkstorage.append(await page.evaluate('(element) => element.href', element))
    return linkstorage

async def browse_all_links(page,link):
    await page.goto(link)
    await page.waitForSelector('h1 > a')
    title = await page.querySelectorEval('h1 > a','(e => e.innerText)')
    print(title)

async def main():
    browser = await launch(headless=False,autoClose=False)
    [page] = await browser.pages()
    links = await fetch(page,url)
    tasks = [await browse_all_links(page,url) for url in links]
    await asyncio.gather(*tasks)

if __name__ == '__main__':
    asyncio.run(main())

The above script fetches some titles but spits out the following error at some point within the execution:

Possible to select <a> with specific text within the quotes?
Crawler Runs Too Slow
How do I loop a list of ticker to scrape balance sheet info?
How to retrive the url of searched video from youtbe using python
VBA-JSON to import data from all pages in one table
Is there an algorithm that detects semantic visual blocks in a webpage?
find_all only scrape the last value

#ERROR STARTS

Future exception was never retrieved
future: <Future finished exception=NetworkError('Protocol error (Runtime.releaseObject): Cannot find context with specified id')>
pyppeteer.errors.NetworkError: Protocol error (Runtime.releaseObject): Cannot find context with specified id
Future exception was never retrieved

Upvotes: 2

Views: 2403

Answers (1)

TwinckleTwinckle
TwinckleTwinckle

Reputation: 309

AS it's been two days since this question has been posted but no one yet to answer, I will take this opportunity to address this issue what I think might be helpful to you.

  • There are 15 links but you are getting only 7, this is probably websockets is loosing connection and page is not reachable anymore

  • List comprehension

tasks = [await browse_all_links(page,url) for url in links] What do expect is this list? If it's succesful, it will be a list of none element. So your next line of code will throw error!

  • Solution

    downgrade websockets 7.0 to websockets 6.0

    remove this line of code await asyncio.gather(*tasks)

    I am using python 3.6, so I had to change last line of code. You don't need change it if you are using python 3.7 which I think you are using

import asyncio
from pyppeteer import launch

url = "https://stackoverflow.com/questions/tagged/web-scraping"

async def fetch(page,url):
    await page.goto(url)
    linkstorage = []
    await page.waitForSelector('.summary .question-hyperlink')
    elements = await page.querySelectorAll('.summary .question-hyperlink')
    for element in elements:
        linkstorage.append(await page.evaluate('(element) => element.href', element))
    return linkstorage

async def browse_all_links(page,link):
    await page.goto(link)
    await page.waitForSelector('h1 > a')
    title = await page.querySelectorEval('h1 > a','(e => e.innerText)')
    print(title)
async def main():
    browser = await launch(headless=False,autoClose=False)
    [page] = await browser.pages()
    links = await fetch(page,url)
    tasks = [await browse_all_links(page,url) for url in links]
    #await asyncio.gather(*tasks)
    await browser.close()
if __name__ == '__main__':
    #asyncio.run(main())
    asyncio.get_event_loop().run_until_complete(main())
  • Output

(testenv) C:\Py\pypuppeteer1>python stack3.py Scrapy Shell response.css returns an empty array Scrapy real-time spider Why do I get KeyError while reading data with get request? Scrapy spider can't redefine custom_settings according to args Custom JS Script using Lua in Splash UI Can someone explain why and how this piece of code works [on hold] How can I extract required data from a list of strings? Scrapy CrawlSpider rules for crawling single page how to scrape a web-page with search bar results, when the search query does not appear in the url Nested for loop keeps repeating Get all tags except a list of tags BeautifulSoup Get current URL using Python and webbot How to login to site and send data Unable to append value to colums. Getting error IndexError: list index out of ra nge NextSibling.Innertext not working. “Object doesn't support this property”

Upvotes: 2

Related Questions