Reputation: 73
I'm planning on crawling a website using c++. I have gathered information on how to crawl a website from base. I download the webpages using winhttp library. I want to build one of my own and not use third party libraries. The information I gathered are :
1.Check robots.txt to find which page can be crawled and find the request time gap.
2.Check if the site has sitemap.xml and gathering information from it.
3.Check all the href or url tags and find the folders in it.
Is there anything else that I should do inorder to crawl a website fully?
Upvotes: 3
Views: 1093
Reputation: 4245
You should add database support. I would recommend using Sqlite3. You should have a mechanism for storing the current state of the crawler so if case of premature termination, it can continue from where it stopped last time. Using winhttp library can carry several limitations: - HTTPS support will be a bit limited. For example, support up to 128-bit (see SSL in WinHTTP). - Margin cases of invalid / expired SSL certificate which can be overridden by a browser user. Also HTTP site with a HTTPS prefix and vice versa. I would use libcurl and OpenSSL instead of winhttp.
Upvotes: 1