davissandefur
davissandefur

Reputation: 161

UTF-8 for URL, Java

So I'm trying to scrape a grammar website that gives you conjugations of verbs, but I'm having trouble accessing the pages that require accents, such as the page for the verb "fág".

Here is my current code:

    String url = "http://www.teanglann.ie/en/gram/"+ URLEncoder.encode("fág","UTF-8");
    System.out.println(url);

I've tried this both with and without the URLEncoder.encode() method, and it just keeps giving me a '?' in place of the 'á' when working with it, and my URL search returns nothing. Basically, I was wondering if there was something similar to Python's 'urllib.parse.quote_plus'. I've tried searching and tried many different methods from StackOverflow, all to no avail. Any help would be greatly appreciated.

Eventually, I'm going to replace the given string with a user inputed argument. Just using it to test at the moment.

Solution: It wasn't Java, but IntelliJ.

Upvotes: 2

Views: 1919

Answers (1)

Jayan
Jayan

Reputation: 18459

Summary from comment

The test code works fine.

import java.io.UnsupportedEncodingException;
import static java.net.URLEncoder.encode;

public class MainApp {
    public static void main(String[] args) throws UnsupportedEncodingException {
        String url = "http://www.teanglann.ie/en/gram/"+ encode("fág", "UTF-8");
        System.out.println(url);
    }
}

It emits like below

http://www.teanglann.ie/en/gram/f%EF%BF%BDg

Which would goto correct page.

Correct steps are

  • Ensure that source code encoding is correct. (IntelliJ probably cannot guess it all correct)
  • Run the program with appropriate encoding (utf-8 in this case)

(See What is the default encoding of the JVM? for a relevant discussion)

Edit from Wyzard's comment

Above code works by accident(say does not have whitespace). Correct way to get encoded URL is like bellow ..

 String url = "http://www.teanglann.ie/en/gram/fág";
 System.out.println(new URI(url).toASCIIString());

This uses URI.toASCIIString() which adheres to RFC 2396, which talk about Uniform Resource Identifiers (URI): Generic Syntax

Upvotes: 1

Related Questions