Pavel Pereverzev
Pavel Pereverzev

Reputation: 499

How to get all pages from the whole website using python?

I am trying to make a tool that should get every link from website. For example I need to get all questions pages from stackoverflow. I tried using scrapy.

class MySpider(CrawlSpider):
    name = 'myspider'
    start_urls = ['https://stackoverflow.com/questions/']

    def parse(self, response):
        le = LinkExtractor()
        for link in le.extract_links(response):
            url_lnk = link.url
            print (url_lnk)

Here I got only questions from start page. What I need to do to get all 'question' links. Time doesn't matter, I just need to understand what to do.

UPD

The site which I want to observe is https://sevastopol.su/ - this is a local city news website.

The list of all news should be containde here: https://sevastopol.su/all-news

In the bottom of this page you can see page numbers, but if we go to the last page of news we will see that it has number 765 (right now, 19.06.2019) but it shows the last new with a date of 19 June 2018. So the last page shows only the one-year old news. But there are also plenty of news links that are still alive (probably from 2010 year) and can be even found in search page of this site. So that is why I wanted to know if there can be an access to some global link store of this site.

Upvotes: 3

Views: 9106

Answers (4)

GUNnibal
GUNnibal

Reputation: 1

Preface

While the original question was narrowed down in the details, the wording of the question itself is fairly broad. Because of this I expect multiple people to stumble upon this particular question when solving broader problems and leave unsatisfied. So I want to try and provide an answer that could cover more cases. Some important things to note:

  • This solution assumes that the website you wish to crawl follows a couple of conventions, which it technically doesn't have to. This means that the code you'll find at the end of this post may not work on any website.
  • OP stated that to them "time doesn't matter", but it does to me 😈 This means that in order to save time my solution utilizes asyncio and aiohttp. If you wish to follow along with what the code actually does, then basic knowledge of how async works in Python is mandatory.
  • One minor consequence of the previous point is the fact that the resulting code may generate a noticeable number of requests for the website you are crawling in a short period of time. While fairly unlikely, it is possible for the server to mistake this for a DDoS attack and take some countermeasures.

Background

I mentioned earlier that the website needs to follow certain conventions in order for my kind of crawling to work. There are 2 of them: robots.txt and sitemaps. If you are already familiar with both of these - skip to Code.

"robots.txt" is a file that's usually kept in the website's root directory (if we take the OP's example, that would be https://sevastopol.su/robots.txt). This file is meant for various search engine crawlers and helps them with navigation and figuring out what to work with and what to leave alone. Keeping a file like that is not, strictly speaking, mandatory but it's usually to the website's benefit to have one, so on most websites you will find this exact file in that exact place. Most of what robots.txt contains is not important for our task, except for this line: Sitemap: https://sevastopol.su/sitemap.xml

"Sitemap" is a protocol that uses XML syntax to list pages belonging to a given website along with some other optional data. Like with "robots.txt, its use is not mandatory but it tends to introduce more convenience, so most places use it. The main limitations for this file are:

  • No more than 50000 pages listed in a single file
  • No more than 50 MB total size

What if a website's too beefy to squeeze into one such file? Make more maps! Still too big? Sitemap index files are the next step. They are "maps of the maps", the syntax is largely the same but they point you to other maps rather than to the actual pages. Same limitations apply.

Details for both of these conventions:
https://developers.google.com/search/docs/crawling-indexing/robots/intro
https://www.sitemaps.org/protocol.html

This is all we need to know to figure out how to crawl.

Approach

Assuming the website in question has at least one sitemap that is referenced by the robots.txt, we need to do the following:

  1. fetch robots.txt and grab the locations of all the maps ("Sitemap:" line may be repeated multiple times)
  2. fetch each map and figure out if it's an actual map or a map index
  3. grab all pages from an actual map; or grab all map locations from a map index and go to step 2 again (decompress maps if gzip compression is applied)

And this is exactly what my solution does using 2 functions: "collectPagesFromMaps" for step 1 and "processMapsRecursively" for steps 2 and 3 (yes, it recurses, don't judge me).

Code

import asyncio, aiohttp, re, gzip

async def processMapsRecursively(queue, session, mainDomain, foundPageAddresses):
    #take map address from queue and check if it responds
    map = await queue.get()
    if not map.startswith('http'):
        map = mainDomain + map
    async with session.get(map) as resp:
        if resp.status != 200:
            print(f'"{map}" returned code "{resp.status}", ignoring')
            queue.task_done()
            return None
        
        #decompress map if necessary
        if map.endswith('.gz') or map.endswith('.tgz'):
            gzFile = gzip.decompress(await resp.read())
            content = gzFile.decode()
        else:
            content = await resp.text()
    
    #if map is a map index - recurse
    if '<sitemapindex' in content:
        otherMaps = re.findall(r'<sitemap.+?<loc>\s*(.+?)\s*</loc>.+?</sitemap>', content, re.DOTALL)
        mapQueue = asyncio.Queue()
        for newMap in otherMaps:
            mapQueue.put_nowait(newMap)
        tasks = []
        for _ in range(mapQueue.qsize()):
            task = asyncio.create_task(processMapsRecursively(mapQueue, session, mainDomain, foundPageAddresses))
            tasks.append(task)
        await mapQueue.join()
        for task in tasks:
            task.cancel()
        await asyncio.gather(*tasks, return_exceptions=True)
        queue.task_done()
    
    #if map is an actual map - parse
    else:
        newPages = re.findall(r'<url.+?<loc>\s*(.+?)\s*</loc>.+?</url>', content, re.DOTALL)
        foundPageAddresses.extend(newPages)
        queue.task_done()

async def collectPagesFromMaps(sitePage):
    #locate and parse robots.txt
    mainDomain = re.match(r'https?\://[a-zA-Z0-9\-\.]+', sitePage).group(0)
    session = aiohttp.ClientSession()
    async with session.get(f'{mainDomain}/robots.txt') as resp:
        robots = await resp.text()
    mainDomain = re.match(r'https?\://[a-zA-Z0-9\-\.]+', str(resp.url)).group(0)
    siteMaps = re.findall(r'Sitemap\:\s*(.+?\.xml)', robots)
    if len(siteMaps) == 0:
        async with session.get(f'{mainDomain}/robots.txt') as resp:
            robots = await resp.text()
        siteMaps = re.findall(r'Sitemap\:\s*(.+?\.xml)', robots)
    if len(siteMaps) == 0:
        print('No maps found!')
        await session.close()
        return None
    
    #put map addresses in queue
    siteMaps = [mapFile if mapFile.startswith('http') else mainDomain + mapFile for mapFile in siteMaps]
    mapQueue = asyncio.Queue()
    for map in siteMaps:
        mapQueue.put_nowait(map)
    foundPageAddresses = []
    tasks = []
    for _ in range(mapQueue.qsize()):
        task = asyncio.create_task(processMapsRecursively(mapQueue, session, mainDomain, foundPageAddresses))
        tasks.append(task)
    
    #process all maps
    await mapQueue.join()
    for task in tasks:
        task.cancel()
    await asyncio.gather(*tasks, return_exceptions=True)
    await session.close()
    
    return foundPageAddresses

Example

>>> pages = asyncio.run(collectPagesFromMaps('https://sevastopol.su/all-news'))
>>> print(len(pages))
>>> 197734

Upvotes: 0

Raphael
Raphael

Reputation: 1801

your spider which now yields requests to crawl subsequent pages

from scrapy.spiders import CrawlSpider
from scrapy import Request
from urllib.parse import urljoin

class MySpider(CrawlSpider):
    name = 'myspider'
    start_urls = ['https://sevastopol.su/all-news']

    def parse(self, response):
        # This method is called for every successfully crawled page

        # get all pagination links using xpath
        for link in response.xpath("//li[contains(@class, 'pager-item')]/a/@href").getall():
            # build the absolute url 
            url = urljoin('https://sevastopol.su/', link)
            print(url)
            yield Request(url=url, callback=self.parse)  # <-- This makes your spider recursiv crawl subsequent pages

note that you don't have to worry about requesting the same url multiple times. Duplicates are dropped by scrapy (default settings).

Next steps:

Upvotes: 0

SIM
SIM

Reputation: 22440

This is something you might wanna do to get all the links to the different questions asked. However, I thing your script might get 404 error somewhere within the execution as there are millions links to parse.

Run the script just the way it is:

import scrapy

class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ["https://stackoverflow.com/questions/"]

    def parse(self, response):
        for link in response.css('.summary .question-hyperlink::attr(href)').getall():
            post_link = response.urljoin(link)
            yield {"link":post_link}

        next_page = response.css("a[rel='next']::attr(href)").get()
        if next_page:
            next_page_url = response.urljoin(next_page)
            yield scrapy.Request(next_page_url,callback=self.parse)

Upvotes: 3

wohe1
wohe1

Reputation: 775

You should write a regular expression (or a similar search function) that looks for <a> tags with a specific class (in the case of so: class="question-hyperlink") and take the href attribute from those elements. This will fetch all the links from the current page.

Then you can also search for the page links (at the bottom). Here you see that those links are /questions?sort=active&page=<pagenumber> where you can change <pagenumber> with the page you want to scrape. (e.g. make a loop that starts at 1 and goes on until you get a 404 error.

Upvotes: 0

Related Questions