Reputation: 4238
The web crawler Apache Nutch comes with a built-in support for NTLM. I'm trying to use version 1.7 to crawl a web site (Windows Sharepoint) using NTLM authentication. I have setup Nutch according to https://wiki.apache.org/nutch/HttpAuthenticationSchemes which means in particular that I have credentials
<credentials username="rickert" password="mypassword">
<authscope host="server-to-be-crawled.com" port="80" realm="CORP" scheme="NTLM"/>
</credentials>
configured. When I look at the log files I can see that Nutch tries to access the seed URL and goes through "normal" NTLM cycle: obtain an 401 error during the first GET, extract the NTLM challenge and send the NTLM authentication in the next GET (using a keep-alive connection). However, the second GET is not successful either.
That's the point when I was suspecting some fundamental problems with my credentials or the specific setup: I'm running Nutch in a Debian guest Virtual Box on a Windows host. But to my surprise both wget
and curl
were able to retrieve the document from within the Debian guest using my credentials. The interesting thing is that both command line tools ONLY require a username and a password to work. The full fledge NTLM specification, on the other hand, also requires a host and a domain. According to the specs the host is the one that the request originates from which I would interpret as the one that the http-agent is running on, the domain in the Windows domain that the username is associated with. My assumption is that both tools simply leave this details empty.
This is where the configuration of Nutch comes in: the host is allegedly supplied as http.agent.host
in the configuration file. The domain is supposed to be configured as the realm of the credential but the documentation rather says that this a convention and not really necessary. However, it does not matter whether I set a realm or not the result is the same. Again looking at the log file I can see some messages that the authentication is resolved using <any_realm>@server-to-be-crawled.com
no matter which realm I use.
My gut feeling is that there is some wrong mapping of the Nutch configuration values onto the NTLM parameters required by the Java class httpclient
that executing the GET. I'm helpless. Can anybody give me some hints as to how to further debug this? Does anybody have a concrete config that works for a SharePoint Server? Thanks!
Upvotes: 3
Views: 1519
Reputation: 831
This is an old thread but it seems to be a common problem and I finally found a solution.
In my case the issue was that the content source that I was trying to crawl was hosted on a fairly up to date IIS server. Inspection of the headers indicated that it was using NTLMv1, but after reading that the Apache Commons HttpClient v3.x only supports NTLMv1 and not NTLMv2 I went looking for a way to add that support to nutch v1.15 without upgrading to the newer HttpComponents version of HttpClient.
The clue is in the documentation for the newer HC version of HttpClient So, using this approach with JCIFS I managed to modify the nutch protocol-httpclient Http class so that it used my new JCIFS based NTLM scheme for authentication. Steps to do this:
Job done, I was able to then crawl NTLMv2 protected websites.
By also adding lots of extra logging I could then see the authentication handshake details which showed it was in fact NTLMv2 being used.
The change in Http.configureClient looks like this:
/** Configures the HTTP client */
private void configureClient() {
LOG.info("Setting new NTLM scheme: " + JcifsNtlmScheme.class.getName());
AuthPolicy.registerAuthScheme(AuthPolicy.NTLM, JcifsNtlmScheme.class);
...
}
The new NTLM scheme implementation looks like this (needs a bit of tidying up).
public class JcifsNtlmScheme implements AuthScheme {
public static final Logger LOG = LoggerFactory.getLogger(JcifsNtlmScheme.class);
/** NTLM challenge string. */
private String ntlmchallenge = null;
private static final int UNINITIATED = 0;
private static final int INITIATED = 1;
private static final int TYPE1_MSG_GENERATED = 2;
private static final int TYPE2_MSG_RECEIVED = 3;
private static final int TYPE3_MSG_GENERATED = 4;
private static final int FAILED = Integer.MAX_VALUE;
/** Authentication process state */
private int state;
public JcifsNtlmScheme() throws AuthenticationException {
// Check if JCIFS is present. If not present, do not proceed.
try {
Class.forName("jcifs.ntlmssp.NtlmMessage", false, this.getClass().getClassLoader());
LOG.trace("jcifs.ntlmssp.NtlmMessage is present");
} catch (ClassNotFoundException e) {
throw new AuthenticationException("Unable to proceed as JCIFS library is not found.");
}
}
public String authenticate(Credentials credentials, HttpMethod method) throws AuthenticationException {
LOG.trace("authenticate called. State: " + this.state);
if (this.state == UNINITIATED) {
throw new IllegalStateException("NTLM authentication process has not been initiated");
}
NTCredentials ntcredentials = null;
try {
ntcredentials = (NTCredentials) credentials;
} catch (ClassCastException e) {
throw new InvalidCredentialsException(
"Credentials cannot be used for NTLM authentication: " + credentials.getClass().getName());
}
NTLM ntlm = new NTLM();
String charset = method.getParams().getCredentialCharset();
LOG.trace("Setting credential charset to: " + charset);
ntlm.setCredentialCharset(charset);
String response = null;
if (this.state == INITIATED || this.state == FAILED) {
LOG.trace("Generating TYPE1 message");
response = ntlm.generateType1Msg(ntcredentials.getHost(), ntcredentials.getDomain());
this.state = TYPE1_MSG_GENERATED;
} else {
LOG.trace("Generating TYPE3 message");
response = ntlm.generateType3Msg(ntcredentials.getUserName(), ntcredentials.getPassword(),
ntcredentials.getHost(), ntcredentials.getDomain(), this.ntlmchallenge);
this.state = TYPE3_MSG_GENERATED;
}
String result = "NTLM " + response;
return result;
}
public String authenticate(Credentials credentials, String method, String uri) throws AuthenticationException {
throw new RuntimeException("Not implemented as it is deprecated anyway in Httpclient 3.x");
}
public String getID() {
throw new RuntimeException("Not implemented as it is deprecated anyway in Httpclient 3.x");
}
/**
* Returns the authentication parameter with the given name, if available.
*
*
* There are no valid parameters for NTLM authentication so this method always
* returns null.
*
*
* @param name The name of the parameter to be returned
*
* @return the parameter with the given name
*/
public String getParameter(String name) {
if (name == null) {
throw new IllegalArgumentException("Parameter name may not be null");
}
return null;
}
/**
* The concept of an authentication realm is not supported by the NTLM
* authentication scheme. Always returns null
.
*
* @return null
*/
public String getRealm() {
return null;
}
/**
* Returns textual designation of the NTLM authentication scheme.
*
* @return ntlm
*/
public String getSchemeName() {
return "ntlm";
}
/**
* Tests if the NTLM authentication process has been completed.
*
* @return true if Basic authorization has been processed,
* false otherwise.
*
* @since 3.0
*/
public boolean isComplete() {
boolean result = this.state == TYPE3_MSG_GENERATED || this.state == FAILED;
LOG.trace("isComplete? " + result);
return result;
}
/**
* Returns true. NTLM authentication scheme is connection based.
*
* @return true.
*
* @since 3.0
*/
public boolean isConnectionBased() {
return true;
}
/**
* Processes the NTLM challenge.
*
* @param challenge the challenge string
*
* @throws MalformedChallengeException is thrown if the authentication challenge
* is malformed
*
* @since 3.0
*/
public void processChallenge(final String challenge) throws MalformedChallengeException {
String s = AuthChallengeParser.extractScheme(challenge);
LOG.trace("processChallenge called. challenge: " + challenge + " scheme: " + s);
if (!s.equalsIgnoreCase(getSchemeName())) {
LOG.trace("Invalid scheme name in challenge. Should be: " + getSchemeName());
throw new MalformedChallengeException("Invalid NTLM challenge: " + challenge);
}
int i = challenge.indexOf(' ');
if (i != -1) {
LOG.trace("processChallenge: TYPE2 message received");
s = challenge.substring(i, challenge.length());
this.ntlmchallenge = s.trim();
this.state = TYPE2_MSG_RECEIVED;
} else {
this.ntlmchallenge = "";
if (this.state == UNINITIATED) {
this.state = INITIATED;
LOG.trace("State was UNINITIATED, switching to INITIATED");
} else {
LOG.trace("State is FAILED");
this.state = FAILED;
}
}
}
private class NTLM {
/** Character encoding */
public static final String DEFAULT_CHARSET = "ASCII";
/**
* The character was used by 3.x's NTLM to encode the username and password.
* Apparently, this is not needed in when passing username, password from
* NTCredentials to the JCIFS library
*/
private String credentialCharset = DEFAULT_CHARSET;
void setCredentialCharset(String credentialCharset) {
this.credentialCharset = credentialCharset;
}
private String generateType1Msg(String host, String domain) {
jcifs.ntlmssp.Type1Message t1m = new jcifs.ntlmssp.Type1Message(
jcifs.ntlmssp.Type1Message.getDefaultFlags(), domain, host);
String result = jcifs.util.Base64.encode(t1m.toByteArray());
LOG.trace("generateType1Msg: " + result);
return result;
}
private String generateType3Msg(String username, String password, String host, String domain,
String challenge) {
jcifs.ntlmssp.Type2Message t2m;
try {
t2m = new jcifs.ntlmssp.Type2Message(jcifs.util.Base64.decode(challenge));
} catch (IOException e) {
throw new RuntimeException("Invalid Type2 message", e);
}
jcifs.ntlmssp.Type3Message t3m = new jcifs.ntlmssp.Type3Message(t2m, password, domain, username, host, 0);
String result = jcifs.util.Base64.encode(t3m.toByteArray());
LOG.trace("generateType3Msg username: [" + username + "] host: [" + host + "] domain: [" + domain
+ "] response: [" + result + "]");
return result;
}
}
}
Upvotes: 1