Steffen Spranger
Steffen Spranger

Reputation: 31

Sort latin characters to the end in Japanese sorting

I'd like to sort strings in Japanese (that may contain the various japanese characters as well as latin chars), and the latin chars should be sorted to the end.

final Collator collator = Collator.getInstance(Locale.JAPANESE);
List<String> objcts = new ArrayList<>();

objcts.add("Alpha");
objcts.add("家事問屋");

Collections.sort(objcts, collator);
System.out.println(objcts);

Out: [Alpha, 家事問屋]

Desired Out: [家事問屋, Alpha]

Is there a simple way known how to achive this?

Upvotes: 2

Views: 616

Answers (4)

hc_dev
hc_dev

Reputation: 9377

Probably you could implement a Comparator or extend Collator that ranks Latin before CJK using a regex like this:

public class LatinBeforeCJKCollator implements Comparator<String> {

    private final Collator collator;

    public LatinBeforeCJKCollator(Collator collator) {
        this.collator = collator;
    }

    @Override
    public int compare(String source, String target) {
        if (source.matches("[\\p{IsHiragana}\\p{IsKatakana}\\p{IsHan}]+") && target.matches("\\p{IsLatin}+")) {
            return -1;
        }
        if (source.matches("\\p{IsLatin}+") && target.matches("[\\p{IsHiragana}\\p{IsKatakana}\\p{IsHan}]+")) {
            return 1;
        }
        return collator.compare(source, target);
    }

}

I used Unicode character-sets from answer to this question: How can I detect japanese text in a Java string?

You might need to customize the matching (e.g. all letters are latin, first letter is latin, etc.) after your needs.

When used like this:

final Comparator comparator = new LatinBeforeCJKCollator(Collator.getInstance(Locale.JAPANESE);
List<String> strings = List.of("Alpha", "Beta", "問屋", "家事問屋");

System.out.println(strings.stream().sorted(collator).collect(Collectors.joining(",")));

Then the output would appear sorted like this:

家事問屋,問屋,Alpha,Beta

Upvotes: 3

Aya Noaman
Aya Noaman

Reputation: 377

I don't code much in Java, but I can explain the steps you can take.

As far as I know, there is no alphabet string provided in Java, so you can create a string variable that contains the alphabet (both upper- and lower-case). Let's call it alphabet. The string would look like this: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

Then you'll have to make a variable containing the last index number (a.k.a. the size of the list). We will call it last.

Assuming each item is either fully Japanese or fully Latin and assuming that your list is already full, you can loop through the list and perform these steps on each item:

  1. Get the first character in the string.

  2. Test to see if it is in alphabet.

  3. If True, set its index in the list to last. If False, leave it as it is.

That's basically it! I sincerely apologise for not being able to provide the code, as I code mostly in Python, but I hope this helped!

Upvotes: 0

feedy
feedy

Reputation: 1150

Does the order of the Japanese and English strings matter? If yes, you need to implement your own comparison method for the collator.

If the order does not matter, you can just do:

Collections.sort(objcts, Collections.reverseOrder());

To add a bit more to this - a collator is usually used for a single language, therefore you need to implement a way to differentiate the characters for the two alphabets. I would strongly suggest you to use two separate lists for English and Japanese text, where you detect what language the characters are in and decide in which list to put the word it. Then you can sort both lists accordingly and combine/use them as you wish.

Upvotes: 0

Gwang-Jin Kim
Gwang-Jin Kim

Reputation: 9865

I guess, the letters are in Unicode.

The range of Latin letters is

Wiki in this wiki article says:

As of version 13.0 of the Unicode Standard, 1,374 characters in the fo: llowing blocks are classified as belonging to the Latin script:2

  • Basic Latin, 0000–007F. This block corresponds to ASCII.
  • Latin-1 Supplement, 0080–00FF
  • Latin Extended-A, 0100–017F
  • Latin Extended-B, 0180–024F
  • IPA Extensions, 0250–02AF
  • Spacing Modifier Letters, 02B0–02FF
  • Phonetic Extensions, 1D00–1D7F
  • Phonetic Extensions Supplement, 1D80–1DBF
  • Latin Extended Additional, 1E00–1EFF
  • Superscripts and Subscripts, 2070–209F
  • Letterlike Symbols, 2100–214F
  • Number Forms, 2150–218F
  • Latin Extended-C, 2C60–2C7F
  • Latin Extended-D, A720–A7FF
  • Latin Extended-E, AB30–AB6F
  • Alphabetic Presentation Forms (Latin ligatures) FB00–FB4F
  • Halfwidth and Fullwidth Forms, FF00–FFEF

So most of them are before the Japanese. Using these ranges, you could make that Japanese letters are put in front.

And the range of Japanese is

  • Japanese-style punctuation ( 3000 - 303f)
  • Hiragana ( 3040 - 309f)
  • Katakana ( 30a 0 - 30ff)
  • Full-width roman characters and half-width katakana ( ff00 - ffef)
  • CJK unifed ideographs - Common and uncommon kanji ( 4e00 - 9faf)

listed here. According to this post.

Upvotes: 0

Related Questions