Reputation: 790
I have been working on web scraping and encountered the below patterns in one robots.txt file.
Disallow: /*{{url}}*
Disallow: /*{{imageURL}}*
Do they mean than I am not allowed scrape any URL?
Upvotes: 2
Views: 299
Reputation: 96577
This looks like the site author made an error, as {{url}}
and {{imageURL}}
were probably intended to be variables that should be replaced with the actual values.
When interpreting this record according to the original robots.txt specification, all characters have to be interpreted literally, so URLs like these would be disallowed:
https://example.com/*{{url}}*
https://example.com/*{{url}}*.bar
https://example.com/*{{url}}*/
https://example.com/*{{url}}*/foo
As {
and }
are not allowed to appear in a URL path (list of allowed characters), it would mean that all URLs are allowed to be crawled. However, if you prefer, you could assume that it applies to the percent-encoded forms of {
/}
, but that’s not something the spec requires.
When interpreting this record based on popular extensions of the robots.txt spec (e.g., as used by Google Search), the *
has a special meaning: each *
in a Disallow
value can be replaced with nothing or any sequence of characters. This would lead to many more disallowed URLs, but they would still have to contain literally {{url}}
and {{imageURL}}
.
Upvotes: 1