are crawlers able to get all directories on webpage?

Question

I want to backup some files on root of my webpage, something like /www/mysite/myfolder/myfile.xls Are crawlers able to find the directory? Even it is not used for files that are necessary for website? Thank you

aufziehvogel · Accepted Answer

A webcrawler without brute-force or dictionary trials (explained later) is able to find a file, if there exists at least one link to the file (on a page the crawler has visited before).

From the path /www/myfolder/myfile.xls I assume there might be even another problem. A webcrawler can only find files that are publicly available. Sometimes not all files under /www, /var/www, /htdocs or whatever is being used are publicly available. There might be structures like /www/mysite/public, where only public is available from the web. With such a structure one could make sure, that files in /www/mysite cannot be accessed without permission checks by PHP before the download.

So you have to check if

your directory can be accessed via HTTP/FTP or whatever
there exists a link to your file on another webpage the crawler can find (technically there must be one start page for the crawler of course)

Exception: brute-force trials

There is an exception when also files without a link can be found: Search engines could try to find files by extending the already known URL-space of a domain by known words or random words. This of course can only be done sporadically. Consider a TinyURL generator. Usually these consist of a short known URL and some random characters. These short character sequences could be tried out by a search engine hoping to find files in the so called deep web. E.g. it's possible nobody has ever written the link http://example.com/f8fwy down anywhere, nontheless it could link to a real domain (if you are lucky some website or file that has never been linked to either).

However, with search engines offering mail providers (Google) or chat engines (Microsoft, Skype), I think this technique has become less important, because they could try to gain deep web links by these services.

are crawlers able to get all directories on webpage?

Answers (2)

Exception: brute-force trials

Related Questions