arvil
arvil

Reputation: 920

using wget or curl to test website's .htaccess + robots.txt

I am trying to debug my website's .htaccess + robots.txt, I want to use cURL or wget to try to access files that I blocked using robots.txt or pages that should redirect to another location via .htaccess

I have the following in my robots.txt

User-agent: *
Disallow: /wp/wp-admin/

yet, I still be able to crawl it

wget

$ wget http://xxxx.com/wp/wp-admin/
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
--2017-08-28 07:37:05--  http://xxxx.com/wp/wp-admin/
Resolving xxxx.com... 118.127.47.249
Connecting to xxxx.com|118.127.47.249|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://xxxx.com/wp/wp-login.php?redirect_to=http%3A%2F%2Fxxxx.com%2Fwp%2Fwp-
admin%2F&reauth=1 [following]
--2017-08-28 07:37:12--  http://xxxx.com/wp/wp-login.php?redirect_to=http%3A%2F%2Fxxxx.com%2Fwp%2Fwp-admin%2F&reauth=1
Connecting to xxxx.com|118.127.47.249|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2891 (2.8K) [text/html]
Saving to: `wp-login.php@redirect_to=http%3A%2F%2Fxxxx.com%2Fwp%2Fwp-admin%2F&reauth=1'

100%[==============================================================================>] 2,891       --.-K/s   in 0.1s

2017-08-28 07:37:17 (22.2 KB/s) - `wp-login.php@redirect_to=http%3A%2F%2Fxxxx.com%2Fwp%2Fwp-admin%2F&re
auth=1' saved [2891/2891]

curl

$ curl -L xxx.com/wp/wp-admin -o wp-admin.html
% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  1147  100  1147    0     0    107      0  0:00:10  0:00:10 --:--:--   280
0     0    0     0    0     0      0      0 --:--:--  0:01:37 --:--:--     0
100  2891  100  2891    0     0     17      0  0:02:50  0:02:42  0:00:08   234

neither wget nor curl respected robots.txt Is there a way to check how my .htaccess+robots.txt? Thanks!

Upvotes: 2

Views: 5181

Answers (1)

jrtapsell
jrtapsell

Reputation: 7031

robots.txt is purely for search engine bots, it is ignored by most user browsers [including wget and curl], if you want to check that your robots.txt is parseable you can use google's checker in the webmaster console, which shows any errors and issues which may exist with your robots.txt file.

Redirects using .htaccess should work with any browser, and wget should show these redirects.

Upvotes: 3

Related Questions