Reputation: 31
I'd like to sort strings in Japanese (that may contain the various japanese characters as well as latin chars), and the latin chars should be sorted to the end.
final Collator collator = Collator.getInstance(Locale.JAPANESE);
List<String> objcts = new ArrayList<>();
objcts.add("Alpha");
objcts.add("家事問屋");
Collections.sort(objcts, collator);
System.out.println(objcts);
Out: [Alpha, 家事問屋]
Desired Out: [家事問屋, Alpha]
Is there a simple way known how to achive this?
Upvotes: 2
Views: 616
Reputation: 9377
Probably you could implement a Comparator
or extend Collator
that ranks Latin before CJK using a regex like this:
public class LatinBeforeCJKCollator implements Comparator<String> {
private final Collator collator;
public LatinBeforeCJKCollator(Collator collator) {
this.collator = collator;
}
@Override
public int compare(String source, String target) {
if (source.matches("[\\p{IsHiragana}\\p{IsKatakana}\\p{IsHan}]+") && target.matches("\\p{IsLatin}+")) {
return -1;
}
if (source.matches("\\p{IsLatin}+") && target.matches("[\\p{IsHiragana}\\p{IsKatakana}\\p{IsHan}]+")) {
return 1;
}
return collator.compare(source, target);
}
}
I used Unicode character-sets from answer to this question: How can I detect japanese text in a Java string?
You might need to customize the matching (e.g. all letters are latin, first letter is latin, etc.) after your needs.
When used like this:
final Comparator comparator = new LatinBeforeCJKCollator(Collator.getInstance(Locale.JAPANESE);
List<String> strings = List.of("Alpha", "Beta", "問屋", "家事問屋");
System.out.println(strings.stream().sorted(collator).collect(Collectors.joining(",")));
Then the output would appear sorted like this:
家事問屋,問屋,Alpha,Beta
Upvotes: 3
Reputation: 377
I don't code much in Java, but I can explain the steps you can take.
As far as I know, there is no alphabet string provided in Java, so you can create a string variable that contains the alphabet (both upper- and lower-case). Let's call it alphabet
. The string would look like this: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
Then you'll have to make a variable containing the last index number (a.k.a. the size of the list). We will call it last
.
Assuming each item is either fully Japanese or fully Latin and assuming that your list is already full, you can loop through the list and perform these steps on each item:
Get the first character in the string.
Test to see if it is in alphabet
.
If True
, set its index in the list to last
. If False
, leave it as it is.
That's basically it! I sincerely apologise for not being able to provide the code, as I code mostly in Python, but I hope this helped!
Upvotes: 0
Reputation: 1150
Does the order of the Japanese and English strings matter? If yes, you need to implement your own comparison method for the collator.
If the order does not matter, you can just do:
Collections.sort(objcts, Collections.reverseOrder());
To add a bit more to this - a collator is usually used for a single language, therefore you need to implement a way to differentiate the characters for the two alphabets. I would strongly suggest you to use two separate lists for English and Japanese text, where you detect what language the characters are in and decide in which list to put the word it. Then you can sort both lists accordingly and combine/use them as you wish.
Upvotes: 0
Reputation: 9865
I guess, the letters are in Unicode.
The range of Latin letters is
Wiki in this wiki article says:
As of version 13.0 of the Unicode Standard, 1,374 characters in the fo: llowing blocks are classified as belonging to the Latin script:2
- Basic Latin, 0000–007F. This block corresponds to ASCII.
- Latin-1 Supplement, 0080–00FF
- Latin Extended-A, 0100–017F
- Latin Extended-B, 0180–024F
- IPA Extensions, 0250–02AF
- Spacing Modifier Letters, 02B0–02FF
- Phonetic Extensions, 1D00–1D7F
- Phonetic Extensions Supplement, 1D80–1DBF
- Latin Extended Additional, 1E00–1EFF
- Superscripts and Subscripts, 2070–209F
- Letterlike Symbols, 2100–214F
- Number Forms, 2150–218F
- Latin Extended-C, 2C60–2C7F
- Latin Extended-D, A720–A7FF
- Latin Extended-E, AB30–AB6F
- Alphabetic Presentation Forms (Latin ligatures) FB00–FB4F
- Halfwidth and Fullwidth Forms, FF00–FFEF
So most of them are before the Japanese. Using these ranges, you could make that Japanese letters are put in front.
And the range of Japanese is
listed here. According to this post.
Upvotes: 0