user1174105
user1174105

Reputation: 23

Encoding an URL sent to a server (not in query)

I need to be testing my server for several URLs daily since these URLs are updated by my users - and this will be dine in Java. However, these URLs contains strange characters (like the german umlaut). Basicly what I am doing is:

for every URL in the list to check
  URL u = new URL(the_url);
  u.openConnection(..);
  // read the content and handle it

Now, what Ive found is that org.apache.commons.codec.net.URLCodec is fine for encoding string to paste into the QueryString, it is not as suitable to encode strange URLs into their hex counterparts. Here are some examples of URLs:

The desired result for the first would be;

Are there any library in the Apache Commons or java itself, to convert special character in the ACTUAL url (not querystring - and therefore not replace the same kind of characters) ?

Thank you for your time.

Edited Firefox translates "yr.no/place/Norway/Nordland/Moskenes/Å/data.html"; into "yr.no/place/Norway/Nordland/Moskenes/%C3%85/data.html" (try this by entering the first URL, press enter, then copy the url into a document). It is this effect that I am looking for - since this is the actual translation. What is most likely happening is either FF knows Å is a bad thing, it tries multiple versions or it accepts the servers "Location" header; either way - there is a tranformation from "Å" to "%C3%85" on only a subset of the URL. This is the function we need.

Edited I just verified that the code given by commentor does not work sadly. As an example, try this:

try{
        String urlStr = "http://www.yr.no/place/Norway/Nordland/Moskenes/Å/data.html";
        URL u=new URL(urlStr);
        URI uri = new URI(u.getProtocol(),
                    u.getUserInfo(), u.getHost(), u.getPort(),
                    u.getPath(), u.getQuery(),
                    null); // removing ref

        URL urlObj = uri.toURL();
        HttpURLConnection connection = (HttpURLConnection) urlObj.openConnection();
        connection.setInstanceFollowRedirects(false);
        connection.connect();

        for (int i=0;i<connection.getHeaderFields().size();i++)
            System.out.println(connection.getHeaderFieldKey(i)+": "+connection.getHeaderField(i));
        System.exit(0);
    }catch(Exception e){e.printStackTrace();};

Will yield a 404 error - strangely enough the encoded part does also not work.

Upvotes: 2

Views: 286

Answers (1)

Dev
Dev

Reputation: 12196

If you need a URL that is a valid URI (RFC 2396 compliant) you can create one like this in Java

    String urlString = "http://www.example.com/u/håkon-hellström/";

    URL url = new URL(urlString);
    URI uri = new URI(url.getProtocol(),url.getAuthority(), url.getPath(), url.getQuery(), url.getRef());
    url = new URL(uri.toASCIIString());

That being said all three sample strings you provided are RFC 2396 compliant and do not need to be encoded. I am assuming the spaces in the authority part of the URLs you provided are typos.

EDIT:

I updated the code block above. By using URI.toASCIIString() you can limit the resulting URI to only US-ASCII characters (other characters are encoded). The resulting string can then be used to create a new, valid URL.

http://www.example.com/u/håkon-hellström/

changes to

http://www.example.com/u/h%C3%A5kon-hellstr%C3%B6m/

Upvotes: 1

Related Questions