Reputation: 499
I am trying to make a tool that should get every link from website. For example I need to get all questions pages from stackoverflow. I tried using scrapy.
class MySpider(CrawlSpider):
name = 'myspider'
start_urls = ['https://stackoverflow.com/questions/']
def parse(self, response):
le = LinkExtractor()
for link in le.extract_links(response):
url_lnk = link.url
print (url_lnk)
Here I got only questions from start page. What I need to do to get all 'question' links. Time doesn't matter, I just need to understand what to do.
UPD
The site which I want to observe is https://sevastopol.su/ - this is a local city news website.
The list of all news should be containde here: https://sevastopol.su/all-news
In the bottom of this page you can see page numbers, but if we go to the last page of news we will see that it has number 765 (right now, 19.06.2019) but it shows the last new with a date of 19 June 2018. So the last page shows only the one-year old news. But there are also plenty of news links that are still alive (probably from 2010 year) and can be even found in search page of this site. So that is why I wanted to know if there can be an access to some global link store of this site.
Upvotes: 3
Views: 9106
Reputation: 1
While the original question was narrowed down in the details, the wording of the question itself is fairly broad. Because of this I expect multiple people to stumble upon this particular question when solving broader problems and leave unsatisfied. So I want to try and provide an answer that could cover more cases. Some important things to note:
I mentioned earlier that the website needs to follow certain conventions in order for my kind of crawling to work. There are 2 of them: robots.txt and sitemaps. If you are already familiar with both of these - skip to Code.
"robots.txt" is a file that's usually kept in the website's root directory (if we take the OP's example, that would be https://sevastopol.su/robots.txt). This file is meant for various search engine crawlers and helps them with navigation and figuring out what to work with and what to leave alone. Keeping a file like that is not, strictly speaking, mandatory but it's usually to the website's benefit to have one, so on most websites you will find this exact file in that exact place. Most of what robots.txt contains is not important for our task, except for this line: Sitemap: https://sevastopol.su/sitemap.xml
"Sitemap" is a protocol that uses XML syntax to list pages belonging to a given website along with some other optional data. Like with "robots.txt, its use is not mandatory but it tends to introduce more convenience, so most places use it. The main limitations for this file are:
What if a website's too beefy to squeeze into one such file? Make more maps! Still too big? Sitemap index files are the next step. They are "maps of the maps", the syntax is largely the same but they point you to other maps rather than to the actual pages. Same limitations apply.
Details for both of these conventions:
https://developers.google.com/search/docs/crawling-indexing/robots/intro
https://www.sitemaps.org/protocol.html
This is all we need to know to figure out how to crawl.
Assuming the website in question has at least one sitemap that is referenced by the robots.txt, we need to do the following:
And this is exactly what my solution does using 2 functions: "collectPagesFromMaps" for step 1 and "processMapsRecursively" for steps 2 and 3 (yes, it recurses, don't judge me).
import asyncio, aiohttp, re, gzip
async def processMapsRecursively(queue, session, mainDomain, foundPageAddresses):
#take map address from queue and check if it responds
map = await queue.get()
if not map.startswith('http'):
map = mainDomain + map
async with session.get(map) as resp:
if resp.status != 200:
print(f'"{map}" returned code "{resp.status}", ignoring')
queue.task_done()
return None
#decompress map if necessary
if map.endswith('.gz') or map.endswith('.tgz'):
gzFile = gzip.decompress(await resp.read())
content = gzFile.decode()
else:
content = await resp.text()
#if map is a map index - recurse
if '<sitemapindex' in content:
otherMaps = re.findall(r'<sitemap.+?<loc>\s*(.+?)\s*</loc>.+?</sitemap>', content, re.DOTALL)
mapQueue = asyncio.Queue()
for newMap in otherMaps:
mapQueue.put_nowait(newMap)
tasks = []
for _ in range(mapQueue.qsize()):
task = asyncio.create_task(processMapsRecursively(mapQueue, session, mainDomain, foundPageAddresses))
tasks.append(task)
await mapQueue.join()
for task in tasks:
task.cancel()
await asyncio.gather(*tasks, return_exceptions=True)
queue.task_done()
#if map is an actual map - parse
else:
newPages = re.findall(r'<url.+?<loc>\s*(.+?)\s*</loc>.+?</url>', content, re.DOTALL)
foundPageAddresses.extend(newPages)
queue.task_done()
async def collectPagesFromMaps(sitePage):
#locate and parse robots.txt
mainDomain = re.match(r'https?\://[a-zA-Z0-9\-\.]+', sitePage).group(0)
session = aiohttp.ClientSession()
async with session.get(f'{mainDomain}/robots.txt') as resp:
robots = await resp.text()
mainDomain = re.match(r'https?\://[a-zA-Z0-9\-\.]+', str(resp.url)).group(0)
siteMaps = re.findall(r'Sitemap\:\s*(.+?\.xml)', robots)
if len(siteMaps) == 0:
async with session.get(f'{mainDomain}/robots.txt') as resp:
robots = await resp.text()
siteMaps = re.findall(r'Sitemap\:\s*(.+?\.xml)', robots)
if len(siteMaps) == 0:
print('No maps found!')
await session.close()
return None
#put map addresses in queue
siteMaps = [mapFile if mapFile.startswith('http') else mainDomain + mapFile for mapFile in siteMaps]
mapQueue = asyncio.Queue()
for map in siteMaps:
mapQueue.put_nowait(map)
foundPageAddresses = []
tasks = []
for _ in range(mapQueue.qsize()):
task = asyncio.create_task(processMapsRecursively(mapQueue, session, mainDomain, foundPageAddresses))
tasks.append(task)
#process all maps
await mapQueue.join()
for task in tasks:
task.cancel()
await asyncio.gather(*tasks, return_exceptions=True)
await session.close()
return foundPageAddresses
>>> pages = asyncio.run(collectPagesFromMaps('https://sevastopol.su/all-news'))
>>> print(len(pages))
>>> 197734
Upvotes: 0
Reputation: 1801
your spider which now yields requests to crawl subsequent pages
from scrapy.spiders import CrawlSpider
from scrapy import Request
from urllib.parse import urljoin
class MySpider(CrawlSpider):
name = 'myspider'
start_urls = ['https://sevastopol.su/all-news']
def parse(self, response):
# This method is called for every successfully crawled page
# get all pagination links using xpath
for link in response.xpath("//li[contains(@class, 'pager-item')]/a/@href").getall():
# build the absolute url
url = urljoin('https://sevastopol.su/', link)
print(url)
yield Request(url=url, callback=self.parse) # <-- This makes your spider recursiv crawl subsequent pages
note that you don't have to worry about requesting the same url multiple times. Duplicates are dropped by scrapy (default settings).
Next steps:
Configure Scrapy (e.g User Agent, Crawl Delay, ...): https://docs.scrapy.org/en/latest/topics/settings.html
Handle Errors (errback): https://docs.scrapy.org/en/latest/topics/request-response.html
Use Item Piplines to store your URLs etc.: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
Upvotes: 0
Reputation: 22440
This is something you might wanna do to get all the links to the different questions asked. However, I thing your script might get 404 error somewhere within the execution as there are millions links to parse.
Run the script just the way it is:
import scrapy
class StackOverflowSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ["https://stackoverflow.com/questions/"]
def parse(self, response):
for link in response.css('.summary .question-hyperlink::attr(href)').getall():
post_link = response.urljoin(link)
yield {"link":post_link}
next_page = response.css("a[rel='next']::attr(href)").get()
if next_page:
next_page_url = response.urljoin(next_page)
yield scrapy.Request(next_page_url,callback=self.parse)
Upvotes: 3
Reputation: 775
You should write a regular expression (or a similar search function) that looks for <a>
tags with a specific class (in the case of so: class="question-hyperlink"
) and take the href
attribute from those elements. This will fetch all the links from the current page.
Then you can also search for the page links (at the bottom). Here you see that those links are /questions?sort=active&page=<pagenumber>
where you can change <pagenumber>
with the page you want to scrape. (e.g. make a loop that starts at 1
and goes on until you get a 404 error.
Upvotes: 0