user5746191
user5746191

Reputation:

How to get absolute url using java or jsoup

I am having a textbox and submit button in my jsp page. When submitting this button with some url in textbox, I am getting the response of that url using URLConnection

    String strUrl = request.getParameter("url");
    URL url = new URL(strUrl);
    HttpURLConnection connection = (HttpURLConnection) url.openConnection();
    byte[] encodedBytes = Base64.encodeBase64("root:pass".getBytes());
    String encoding = new String(encodedBytes);
    HttpURLConnection connection = (HttpURLConnection) url.openConnection();
    connection.setRequestMethod("GET");
    connection.connect();

    InputStream content = (InputStream) connection.getInputStream();
    BufferedReader in = new BufferedReader(new InputStreamReader(content));
    try {
        fWriter = new FileWriter(new File("f:\\new.html"));
        writer = new BufferedWriter(fWriter);
        while ((line = in.readLine()) != null) {
            String s = line.toString();
            writer.write(s);    
        }               
        writer.close();
    } catch (Exception e) {
        e.printStackTrace();
    }

In the resulting html page, every css and js and images were missing as they are pointed to get from local. for example, js is placed as followed in my generated html page.

    <script src="/ajax/libs/jquery/2.1.1/jquery.min.js"></script>

But this actual src is as follows,

    <script src="https://www.url.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>

I know that there are many solution to replace all src, href with url host. Found many answers related to that.

I used a solution as follows,

    if (s.contains(("href="))) {
        if (s.contains("\"../") || s.contains("\"/")) {
                    s = s.replace("\"../", "\"http://" + url.getHost() + "/");
                    s = s.replace("\"/", "\"http://" + url.getHost() + "/");
                    writer.write(s);
                    out.println(s);
        }
    }

Now I am able to get link,but its not useful in all the web sites. which means that it will helpful for only sites having that kind of host only prefix with src and hrefs.

In some websites, links are defined as href="frmArticles.aspx". In this case its not enough to add host with href url, because href and src are different even though I prefix with host. For example, folowing URL having href links as different than its URL.

http://www.nakkheeran.in/Users/frmMagazine.aspx?M=2

    <a href="frmArticles.aspx?A=25744">தை தை தை</a>

If, I am adding host to this href it becomes as follows,

    <a href="http://www.nakkheeran.in/frmArticles.aspx?A=25744">தை தை தை</a>

And this is not available. Because, the actual url is

    <a href="http://www.nakkheeran.in/Users/frmArticles.aspx?A=25744">தை தை தை</a>

Upvotes: 3

Views: 1592

Answers (3)

S. Doe
S. Doe

Reputation: 785

In JSoup you can use org.jsoup.nodes.Node.absUrl(String) as an alternative to attr("abs:href") what @jonas-czech described.

Upvotes: 0

Jonas Czech
Jonas Czech

Reputation: 12328

There are essentially two ways to get the absolute URL:

  • Using Jsoup's abs:href attribute getter. It works like this:

    Element a = myDoc.select("a").first(); //selects tue first link on the page, replace with whatever selector you need to get your link (a element)
    String url = a.attr("abs:href"); //gets the absolute url of the link (href attribute)
    

    Note that you need to provide Jsoup with the URL of the HTML document you are using, so it can resolve the URL correctly, this is done automatically if you use Jsoup.connect(myHtmlUrl).get(), if you are parsing HTML from a String or from a file, you need to provide it, use the appropriate Jsoup.parse() method which allows you to provide a base URL

  • The other way is with Java's built in URL class, which is probably what you should use in your case. You can use it like this:

    String absoluteUrl = new URL(new URL("http://example.com/example.html"), "script.js")
    

    Which would print:

    http://example.com/script.js
    

    To clarify a bit, the first parameter (in this case example.com) is the url your HTML document is from, and the second parameter ("script.js") is the URL found in your HTML.

    In your case, you could use it like:

    String absoluteUrl = new URL(new URL("https://www.url.com/"), "/ajax/libs/jquery/2.1.1/jquery.min.js")
    

    Which will print:

    https://www.url.com/ajax/libs/jquery/2.1.1/jquery.min.js
    

Upvotes: 2

llogiq
llogiq

Reputation: 14511

The URL class has a constructor URL(URL context, String url) that does what you tried doing with regexps.

Edit: In your case the context URL is the source URL of the parsed resource. Let's say you parse something from URL context = new URL("http://example.com/path/to/some.html#where?is+carmen+sandiego"). Then you just take the reference of any link and create a URL ref = new URL(context, src).

Upvotes: 1

Related Questions