Shabari nath k
Shabari nath k

Reputation: 890

robots.txt content / selenium web scraping

I am trying to run web scraping using selenium

What does this robot.txt content mean?

User-Agent: *
Disallow: /go/
Disallow: /launch-announcement/

Can i run web scraping in all folders except go and launch-announcement?

Upvotes: 5

Views: 6431

Answers (2)

NarendraR
NarendraR

Reputation: 7708

What is a robots.txt file?

Robots.txt is a text file webmasters create to instruct web robots (typically search engine robots) how to crawl pages on their website. The robots.txt file is part of the the robots exclusion protocol (REP), a group of web standards that regulate how robots crawl the web, access and index content, and serve that content up to users. The REP also includes directives like meta robots, as well as page-, subdirectory-, or site-wide instructions for how search engines should treat links (such as “follow” or “nofollow”).

In practice, robots.txt files indicate whether certain user agents (web-crawling software) can or cannot crawl parts of a website. These crawl instructions are specified by “disallowing” or “allowing” the behavior of certain (or all) user agents. view more...

The Disallow: tells the robot that it should not visit the mentioned page on the site.

Can i run web scraping in all folders except go and launch-announcement?

Yes you can scrape the other page except these 2.

Upvotes: 5

Chase
Chase

Reputation: 5615

According to the basic robots.txt guide, the rule-

User-Agent: *
Disallow: /go/
Disallow: /launch-announcement/

means crawling /go/ and /launch-announcement/ (and their subdirectories) is disallowed for all user agents.

Upvotes: 3

Related Questions