Reputation: 161
So I'm trying to scrape a grammar website that gives you conjugations of verbs, but I'm having trouble accessing the pages that require accents, such as the page for the verb "fág".
Here is my current code:
String url = "http://www.teanglann.ie/en/gram/"+ URLEncoder.encode("fág","UTF-8");
System.out.println(url);
I've tried this both with and without the URLEncoder.encode() method, and it just keeps giving me a '?' in place of the 'á' when working with it, and my URL search returns nothing. Basically, I was wondering if there was something similar to Python's 'urllib.parse.quote_plus'. I've tried searching and tried many different methods from StackOverflow, all to no avail. Any help would be greatly appreciated.
Eventually, I'm going to replace the given string with a user inputed argument. Just using it to test at the moment.
Solution: It wasn't Java, but IntelliJ.
Upvotes: 2
Views: 1919
Reputation: 18459
Summary from comment
The test code works fine.
import java.io.UnsupportedEncodingException;
import static java.net.URLEncoder.encode;
public class MainApp {
public static void main(String[] args) throws UnsupportedEncodingException {
String url = "http://www.teanglann.ie/en/gram/"+ encode("fág", "UTF-8");
System.out.println(url);
}
}
It emits like below
Which would goto correct page.
Correct steps are
(See What is the default encoding of the JVM? for a relevant discussion)
Edit from Wyzard's comment
Above code works by accident(say does not have whitespace). Correct way to get encoded URL is like bellow ..
String url = "http://www.teanglann.ie/en/gram/fág";
System.out.println(new URI(url).toASCIIString());
This uses URI.toASCIIString() which adheres to RFC 2396, which talk about Uniform Resource Identifiers (URI): Generic Syntax
Upvotes: 1