flyingarmadillo
flyingarmadillo

Reputation: 2139

How can I tell if a page allows bots?

I am trying to create a bot that checks to see if a particular URL has some particular content. However, I keep getting an 'HTTP redirection loop' error when I run it.

The only thing I can suspect is that the page does not allow bots. Is there any way to tell if the page does not allow bots? I have googled it, but I have yet to find an answer.

EDIT

After checking somethings out, this is what the robots.txt says:

User-agent: *
Disallow: /advsched/

I also noticed that when I disable cookies in my browser and visit the page, I get the 'HTTP redirection loop' error. So from what I understand, the page I am trying to access is does not allow bots. However, from what I understand about cURL functions, as long as my user-agent is something like this:

Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5

The site cannot tell if I am a bot or not. That leaves only one thing - cookies. I know cURL functions can process cookies, but can they handle them so that I look like a standard user? I have not been able to get it to work yet.

Upvotes: 0

Views: 1175

Answers (2)

ghoti
ghoti

Reputation: 46876

Check /robots.txt and interpret its contents.

Instructions are at http://robotstxt.org/

Upvotes: 0

Brad
Brad

Reputation: 163548

You can't tell.

What's a bot? How does the server know? Generally, the identifying information is in the User-Agent header sent by the client during the request. However, there is no requirement that some server block "bots" on a general level. Suppose they want to just block Google?

Mario's suggestion of checking robots.txt is a good one. Site owners will typically put rules in there for what bots can access, and what to do with information scraped. This won't have anything to do with your re-directions though.

Upvotes: 3

Related Questions