reifier
reifier

Reputation: 155

url harvester string manipulation

I'm doing a recursive url harvest.. when I find an link in the source that doesn't start with "http" then I append it to the current url. Problem is when I run into a dynamic site the link without an http is usually a new parameter for the current url. For example if the current url is something like http://www.somewebapp.com/default.aspx?pageid=4088 and in the source for that page there is a link which is default.aspx?pageid=2111. In this case I need do some string manipulation; this is where I need help.
pseudocode:

if part of the link found is a contains a substring of the current url
      save the substring            
      save the unique part of the link found
replace whatever is after the substring in the current url with the unique saved part

What would this look like in java? Any ideas for doing this differently? Thanks.

As per comment, here's what I've tried:

if (!matched.startsWith("http")) {
    String[] splitted = url.toString().split("/");
    java.lang.String endOfURL = splitted[splitted.length-1];
    boolean b = false;
    while (!b && endOfURL.length() > 5) { // f.bar shortest val
        endOfURL = endOfURL.substring(0, endOfURL.length()-2);
        if (matched.contains(endOfURL)) {
            matched = matched.substring(endOfURL.length()-1);
            matched = url.toString().substring(url.toString().length() - matched.length()) + matched;
            b = true;
        }
    }

it's not working well..

Upvotes: 0

Views: 211

Answers (1)

Stephen C
Stephen C

Reputation: 718698

I think you are doing this the wrong way. Java has two classes URL and URI which are capable of parsing URL/URL strings much more accurately than a "string bashing" solution. For example the URL constructor URL(URL, String) will create a new URL object in the context of an existing one, without you needing to worry whether the String is an absolute URL or a relative one. You would use it something like this:

URL currentPageUrl = ...
String linkUrlString = ...

// (Exception handling not included ...)
URL linkUrl = new URL(currentPageUrl, linkUrlString);

Upvotes: 1

Related Questions