Frank
Frank

Reputation: 31086

How to parse a URI like this in Java

I'm trying to parse the following URI : http://translate.google.com/#zh-CN|en|你

but got this error message :

java.net.URISyntaxException: Illegal character in fragment at index 34: http://translate.google.com/#zh-CN|en|你
        at java.net.URI$Parser.fail(URI.java:2809)
        at java.net.URI$Parser.checkChars(URI.java:2982)
        at java.net.URI$Parser.parse(URI.java:3028)

It's having problem with the "|" character, if I get rid of the "|", the last Chinese char is not causing any problem, what's the right way to handle this ?

My method look like this :

  public static void displayFileOrUrlInBrowser(String File_Or_Url)
  {
    try { Desktop.getDesktop().browse(new URI(File_Or_Url.replace(" ","%20").replace("^","%5E"))); }
    catch (Exception e) { e.printStackTrace(); }
  }

Thanks for the answers, but BalusC's solution seems to work only for an instance of the url, my method needs to work with any url I pass to it, how would it know where's the starting point to cut the url into two parts and only encode the second part ?

Upvotes: 10

Views: 44733

Answers (7)

vaquar khan
vaquar khan

Reputation: 11449

First encode your URL ,please use following example , then pass URL into method

        JSONObject json = new JSONObject();
        json.put("name", "vaquar");
        json.put("age", "30");
        json.put("address", "asasbsa bajsb ");


        System.out.println("in sslRestClientGETRankColl"+json.toString());

        String createdJson=json.toString();

        createdJson= URLEncoder.encode(createdJson, "UTF-8");

//call method now displayFileOrUrlInBrowser(createdJson);

public static void displayFileOrUrlInBrowser(String File_Or_Url)
  {
    try { Desktop.getDesktop().browse(File_Or_Url); }
    catch (Exception e) { e.printStackTrace(); }
  }

Upvotes: 0

Gili
Gili

Reputation: 89993

Taking the best of Federico's answer and Marek's answer, you need to do the following:

URL url = new URL(pageURLAsUnescapedString);

// URI's constructor expects the path, query string and fragment to be decoded.
// If we do not decode them, we will end up with double-encoding.
String path = url.getPath();
if (path != null)
  path = URLDecoder.decode(path, "UTF-8");
String query = url.getQuery();
if (query != null)
  query = URLDecoder.decode(query, "UTF-8");
String fragment = url.getRef();
if (fragment != null)
  fragment = URLDecoder.decode(fragment, "UTF-8");

URI uri = new URI(url.getProtocol(), url.getAuthority(), path, query, fragment);

Upvotes: 3

Federico Pugnali
Federico Pugnali

Reputation: 655

The URLEncoder solution didn't work for me, maybe because it encodes just everything. I was trying to use apache's HttpGet and it throws error with a url as string encoded like that.

The correct way in my case was this strange code:

URL url = new URL(pageURLAsUnescapedString);
URI uri = new URI(url.getProtocol(), url.getAuthority(), url.getPath(), url.getQuery(), url.getRef());

Somehow url.toURI does not work the same way. URI constructors work in two ways: if you use the one with a single String parameter, the constructor pretends the provided uri is correctly escaped (and thus the error, the same happens with the String constructor of HttpGet); if you use the multiple Strings URI constructor, then the class handles everything unescaped very well (and HttpGet has another constructor accepting an URI). Why URL.toURI() does not do this? I have no clue...

Hope it helps someone, it took me some hours to figure it out.

Upvotes: 14

Spike Williams
Spike Williams

Reputation: 37295

The pipe character is "considered unsafe" for use in URLs. You can fix it by replacing the | with its encoded hex equivalent, which would be "%7C"

However, replacing individual characters in a URL is a brittle solution that does not work very well when you consider that, in any given URL, there could potentially be quite a number of different characters that may need to be replaced. You are already replacing spaces, carets, and pipes.... but what about brackets, and accent marks, and quotation marks? Or question marks and ampersands, which may or may not be valid parts of a URL, depending on how they are used?

Thus, a superior solution would be to use the language's facility for encoding URLs, rather than doing it manually. In the case of Java, use URLEncoder, as per the example in BalusC's answer to this question.

Upvotes: 16

BalusC
BalusC

Reputation: 1108557

You should use java.net.URLEncoder to URL-encode the query with UTF-8. You don't necessarily need regex for this. You don't want to have a regex to cover all of those thousands Chinese glyphs, do you? ;)

String query = URLEncoder.encode("zh-CN|en|你", "UTF-8");
String url = "http://translate.google.com/#" + query;
Desktop.getDesktop().browse(new URI(url));    

Upvotes: 7

Geo
Geo

Reputation: 96767

Aren't you better off using URLEncoder than selectively encoding stuff?

Upvotes: 7

Frank
Frank

Reputation: 31086

Alright, I found how to do it, like this :

try { Desktop.getDesktop().browse(new URI(File_Or_Url.replace(" ","%20").replace("^","%5E").replace("|","%7C"))); }
catch (Exception e) { e.printStackTrace(); }

Upvotes: -1

Related Questions