url harvester string manipulation

Question

I'm doing a recursive url harvest.. when I find an link in the source that doesn't start with "http" then I append it to the current url. Problem is when I run into a dynamic site the link without an http is usually a new parameter for the current url. For example if the current url is something like http://www.somewebapp.com/default.aspx?pageid=4088 and in the source for that page there is a link which is default.aspx?pageid=2111. In this case I need do some string manipulation; this is where I need help.
pseudocode:

if part of the link found is a contains a substring of the current url
      save the substring            
      save the unique part of the link found
replace whatever is after the substring in the current url with the unique saved part

What would this look like in java? Any ideas for doing this differently? Thanks.

As per comment, here's what I've tried:

if (!matched.startsWith("http")) {
    String[] splitted = url.toString().split("/");
    java.lang.String endOfURL = splitted[splitted.length-1];
    boolean b = false;
    while (!b && endOfURL.length() > 5) { // f.bar shortest val
        endOfURL = endOfURL.substring(0, endOfURL.length()-2);
        if (matched.contains(endOfURL)) {
            matched = matched.substring(endOfURL.length()-1);
            matched = url.toString().substring(url.toString().length() - matched.length()) + matched;
            b = true;
        }
    }

it's not working well..

url harvester string manipulation

Answers (1)

Related Questions