Reputation: 3325
I would like to validate a long list of URL strings, but some of them contain umlaut characters, e.g.: ä, à, è, ö, etc.
Is there a way to configure the Apache Commons UrlValidator to accept these characters?
This test fails (notice the ã):
@Test
public void urlValidatorShouldPassWithUmlaut()
{
// Given
org.apache.commons.validator.routines.UrlValidator validator;
validator = new UrlValidator( new String[] { "http", "https" }, UrlValidator.ALLOW_ALL_SCHEMES );
// When
String url = "http://dbpedia.org/resource/São_Paulo";
// Then
assertThat( validator.isValid( url ), is( true ) );
}
This test passes (ã replaced with a):
@Test
public void urlValidatorShouldPassWithUmlaut()
{
// Given
org.apache.commons.validator.routines.UrlValidator validator;
validator = new UrlValidator( new String[] { "http", "https" }, UrlValidator.ALLOW_ALL_SCHEMES );
// When
String url = "http://dbpedia.org/resource/Sao_Paulo";
// Then
assertThat( validator.isValid( url ), is( true ) );
}
Software version:
<dependency>
<groupId>commons-validator</groupId>
<artifactId>commons-validator</artifactId>
<version>1.4.0</version>
</dependency>
Update:
validator.isValid( IDN.toASCII(url) )
also fails as IDN.toASCII(url)
does things that I don't yet understand, e.g. it converts http://dbpedia.org/resource/São_Paulo
into http://dbpedia.xn--org/resource/so_paulo-w1b
, which is still invalid according to UrlValidator
Upvotes: 3
Views: 4721
Reputation: 3325
While reading this SO question (Regex: what is InCombiningDiacriticalMarks?) I found that another partial solution is as follows:
public static boolean removeAccentsAndValidateUrl( String url )
{
String normalizedUrl = Normalizer.normalize( url, Normalizer.Form.NFD );
Pattern accentsPattern = Pattern.compile( "\\p{InCombiningDiacriticalMarks}+" );
String urlWithoutAccents = accentsPattern.matcher( normalizedUrl ).replaceAll( "" );
String[] schemes = {"http", "https"};
long options = UrlValidator.ALLOW_ALL_SCHEMES;
UrlValidator urlValidator = new UrlValidator( schemes, options );
return urlValidator.isValid(urlWithoutAccents);
}
However, it turns out that UrlValidator also fails on (among others) "-" characters.
For example, the following fails validation:
http://dbpedia.org/resource/PENTA_–_Pena_Transportes_Aereos
Upvotes: 0
Reputation: 11205
You must encode the umlaut part before you validate it as:
import org.apache.commons.validator.routines.UrlValidator;
import java.io.UnsupportedEncodingException;
import java.net.URLEncoder;
public class UmlautUrlTest {
public static void main(String[] args) {
String url = "http://dbpedia.org/resource/";
String umlautPart="São_Paulo";
UrlValidator v= null;
try {
String s[]={"http", "https"};
v = new UrlValidator(s, UrlValidator.ALLOW_ALL_SCHEMES);
String encodedUrl=URLEncoder.encode(umlautPart,"UTF-8");
System.out.println(v.isValid(url+encodedUrl));
} catch (UnsupportedEncodingException e) {
e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
}
}
}
The output is:
true
S%C3%A3o_Paulo
EDIT:
You can use this function for encoding the whole url for parsing.
public static String encodeUrl(String url) {
String temp[] = url.split("://");
String protocol = temp[0];
String restOfUrl = temp[1];
temp = restOfUrl.split("\\.");
//for the all except last token of host
for (int i = 0; i < temp.length - 1; i++) {
try {
temp[i] = URLEncoder.encode(temp[i], "UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
}
}
String temp2[] = temp[temp.length - 1].split("/");
String host = "";
for (int i = 0; i < temp.length - 1; i++) {
host = host + temp[i];
}
try {
host = host + "." + URLEncoder.encode(temp2[0], "UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
}
host = host.substring(0);
String remainingPart = "";
for (int i = 1; i < temp2.length; i++) {
try {
remainingPart = remainingPart + "/" + URLEncoder.encode(temp2[i], "UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace(); //To change body of catch statement use File | Settings | File Templates.
}
}
return (protocol + "://" + host + remainingPart);
}
And use in your test: validator.isValid(encodeUrl(url))
Upvotes: 1