Reputation: 24154
I am trying to crawl a page that requires Siteminder Authentication, So I am trying to pass my username and password in the code itself to access that page and keep on crawling all the links that are there in that page. This is my Controller.java
code. And from this MyCrawler class is getting called.
public class Controller {
public static void main(String[] args) throws Exception {
CrawlController controller = new CrawlController("/data/crawl/root");
controller.addSeed("http://ho.somehost.com/");
controller.start(MyCrawler.class, 10);
controller.setPolitenessDelay(200);
controller.setMaximumCrawlDepth(3);
}
}
And this is my MyCrawler.java code. In this I am passing my credentials(username and password) for siteminder authentication. And just wanted to make sure that authentication should be done in this MyCrawler code or the above Controller code..??? And this crawler code is taken from here (http://code.google.com/p/crawler4j/)
public class MyCrawler extends WebCrawler {
Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
public MyCrawler() {
}
public boolean shouldVisit(WebURL url) {
System.out.println("RJ:- " +url);
DefaultHttpClient client = null;
try
{
// Set url
//URI uri = new URI(url.toString());
client = new DefaultHttpClient();
client.getCredentialsProvider().setCredentials(
new AuthScope(AuthScope.ANY_HOST, AuthScope.ANY_PORT, null),
new UsernamePasswordCredentials("test", "test"));
// Set timeout
//client.getParams().setParameter(CoreConnectionPNames.SO_TIMEOUT, 5000);
HttpGet request = new HttpGet(url.toString());
HttpResponse response = client.execute(request);
if(response.getStatusLine().getStatusCode() == 200)
{
InputStream responseIS = response.getEntity().getContent();
BufferedReader reader = new BufferedReader(new InputStreamReader(responseIS));
String line = reader.readLine();
while (line != null)
{
System.out.println(line);
line = reader.readLine();
}
}
else
{
System.out.println("Resource not available");
}
}
catch (ClientProtocolException e)
{
System.out.println(e.getMessage());
}
catch (ConnectTimeoutException e)
{
System.out.println(e.getMessage());
}
catch (IOException e)
{
System.out.println(e.getMessage());
}
catch (Exception e)
{
System.out.println(e.getMessage());
}
finally
{
if ( client != null )
{
client.getConnectionManager().shutdown();
}
}
String href = url.getURL().toLowerCase();
if (filters.matcher(href).matches()) {
return false;
}
if (href.startsWith("http://")) {
return true;
}
return false;
}
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String text = page.getText();
List<WebURL> links = page.getURLs();
int parentDocid = page.getWebURL().getParentDocid();
System.out.println("Docid: " + docid);
System.out.println("URL: " + url);
System.out.println("Text length: " + text.length());
System.out.println("Number of links: " + links.size());
System.out.println("Docid of parent page: " + parentDocid);
System.out.println("=============");
}
}
I am printing the url so that I can see what url's are getting printed. So by that way it prints two url one the actual url that requires authentication and then some siteminder url. And when I run this project I get error as following..
RJ:- http://ho.somehost.com/net/pa/ho.xhtml
WARN [Crawler 1] Invalid cookie header: "Set-Cookie: SMCHALLENGE=; expires=Sat, 15 Jan 2011 02:52:54 GMT; path=/; domain=.somehost.com". Unable to parse expires attribute: Sat, 15 Jan 2011 02:52:54 GMT
WARN [Crawler 1] Invalid cookie header: "Set-Co## Heading ##okie: SMIDENTITY=nzFSq2U3g/C3C6/jkj/Ocghyh/njK; expires=Sat, 13 Jul 2013 02:52:54 GMT; path=/; domain=.somehost.com". Unable to parse expires attribute: Sat, 13 Jul 2013 02:52:54 GMT
null
INFO [Crawler 1] Number of pages fetched per second: 0
RJ:- https://lo.somehost.com/site/no/176/sm.exhtml
WARN [Crawler 1] Invalid cookie header: "Set-Cookie: SMCHALLENGE=; expires=Sat, 15 Jan 2011 02:52:56 GMT; path=/; domain=.somehost.com". Unable to parse expires attribute: Sat, 15 Jan 2011 02:52:56 GMT
WARN [Crawler 1] Invalid cookie header: "Set-Cookie: SMIDENTITY=IqsIPo; expires=Sat, 13 Jul 2013 02:52:56 GMT; path=/; domain=.somehost.com". Unable to parse expires attribute: Sat, 13 Jul 2013 02:52:56 GMT
Any suggestions will be appreciated..And If I copy paste that login url into the browser, then it ask for username and password and If I type my username and password, then I get the actual screen.
Upvotes: 1
Views: 6026
Reputation: 76709
Extracting the salient contents of the chat discussion for posterity, in case anyone experiences the same issue.
The warning message displayed, indicated that HttpClient was unable to parse the Set-Cookie
header issued by SiteMinder. Analysis of the network traffic using Wireshark revealed the following:
SMCHALLENGE
and SMIDENTITY
. Therefore, the responses containing the Set-Cookie
headers for these two cookies need to examined.If the above (use of 4 digit years in the cookie expires value) turns out to be an incorrect root cause, then one must specify the date format used to parse the cookie value. This can be done by specifying the list of allowed/accepted date formats by using HttpClient in the following manner:
HttpGet request = new HttpGet(url.toString());
request.getParams().setParameter(CookieSpecPNames.DATE_PATTERNS, Arrays.asList("EEE, d MMM yyyy HH:mm:ss z"));
HttpResponse response = client.execute(request);
instead of the existing calls:
HttpGet request = new HttpGet(url.toString());
HttpResponse response = client.execute(request);
The pattern specified EEE, d MMM yyyy HH:mm:ss z
is a valid pattern for the dates that appear to be parsed incorrectly (going by the messages in the console). You will need to add additional patterns if there are other date formats that are not handled correctly by HttpClient. For details on the format used, see the SimpleDateFormat class documentation.
Upvotes: 1