Malin Pedersen
Malin Pedersen

Reputation: 171

HTTrack possible using cookies

I want to download the page from a URL, easy enough. But on the first page I have to login, as I normally do from a normal browser. But HTTrack is downloading from the first page since it can't use my cookies or login.

Is it any way for me to get around this?

Upvotes: 13

Views: 20113

Answers (3)

Frank Einstein
Frank Einstein

Reputation: 711

This question was asked in 2013 so I'm not sure if Httrack was supporting cookies back then but it definitely does now.

Instructions:

  1. Login to your website using Firefox or Chrome, then take a look at the login cookie.
  2. In the root of the folder where you are downloading your website open the file named cookies.txt or if it's not there just create one and open it.
  3. Copy the login cookie from your browser to this file.
    (You can also copy multiple cookies if you don't know exactly which one is used for login. Some websites can have a lot of cookies with confusing names that all look like a login hash.)

More info:

  • If you don't know how to look at your cookies, it's relatively simple...
    You have to open the Dev Tools (F12) and navigate to the Cookies section:
    For Firefox: F12 -> Storage -> Cookies
    For Chrome: F12 -> Application -> Storage -> Cookies

  • If you are still having problems with Httrack even after you did everything correctly, you can try to copy your browser's User-Agent to your Httrack configuration. By default Httrack is using its own User-agent, some websites might not like it and reject these connections.

Example of a cookie.txt for Httrack:

www.httrack.com TRUE    /       FALSE   1999999999  foo bar
www.example.com TRUE    /folder FALSE   1999999999  JSESSIONID  xxx1234
www.example.com TRUE    /hello  FALSE   1999999999  JSESSIONID  yyy1234

IMPORTANT: Don't copy/paste this example of cookie.txt, StackOverflow is automatically converting TABS into SPACES and the cookie.txt just doesn't work when using spaces... There is nothing I can do to fix this example so only use it as a visual reference. Thanks to tugelblend for pointing this out in the comments.

Reference: http://httrack.kauler.com/help/Cookies

Upvotes: 22

tuomassalo
tuomassalo

Reputation: 9121

Adding to Frank Einstein's answer:

You might not need cookies.txt, as httrack also has --headers option. So, first copy the relevant session cookie from the brwoser, and then you can use:

httrack --headers 'Cookie: SESSIONID=1234...' ...

Upvotes: 1

Kohjah Breese
Kohjah Breese

Reputation: 4136

Try using cURL in PHP:

http://php.net/manual/en/book.curl.php

There are wrappers for this, like:

http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading

Use options such as:

EDIT: More specific, not tested

Download the class from:

http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading

require_once( 'CURL.php' ); //Change this to whatever that class is called in the above
$curl = new CURL();  
$curl->retry = 2;  
    $opts = array(
    CURLOPT_USERAGENT => 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.3) Gecko/20091020 Linux Mint/8 (Helena) Firefox/3.5.3',
    CURLOPT_COOKIEFILE  => 'fb.tmp',
    CURLOPT_COOKIEJAR   => 'fb.tmp',
    CURLOPT_FOLLOWLOCATION  => 1,
    CURLOPT_RETURNTRANSFER  => 1,
    CURLOPT_SSL_VERIFYHOST  => 0,
    CURLOPT_SSL_VERIFYPEER  => 0,
    CURLOPT_TIMEOUT     => 20
);
$post_data = array(  ); //put your login POST data here
$opts[CURLOPT_POSTFIELDS] = http_build_query( $post_data );
$curl->addSession( 'https://www.facebook.com/messages', $opts );  
$result = $curl->exec();  
$curl->clear();
print_r( $result );

Note, that sometimes you need to load a page first, to set a cookie, before they will let you login.

Upvotes: -3

Related Questions