lambypie
lambypie

Reputation: 481

Splitting a multi-lingual string in Coldfusion / Java

I have a text which comes multi-lingual string as shown below

This is a multi-lingual string.私は別の言語にそれを分割する必要がありますPlease help me. This is a multi-lingual string.私は別の言語にそれを分割する必要がありますPlease help me

I have to split into different languages (in this example, both English and Japanese).

i.e., I need the string as below,

1. This is a multi-lingual string.
2. 私は別の言語にそれを分割する必要があります
3. Please help me. This is a multi-lingual string.
4. 私は別の言語にそれを分割する必要があります
5. Please help me

Please help. Thanks in advance.

Upvotes: 1

Views: 306

Answers (1)

Bart Enkelaar
Bart Enkelaar

Reputation: 694

That's a really hard problem, you would need dictionaries to check the words of the sentence against and even then there would be no sure-fire way to do, for example the sentence:

"war war war"

could be "war (english) was (from german war) strange (from dutch war)" but there would be no way to differentiate between these different languages.

To be honest I'm not sure it can be done at all if your problem definition is "Split ANY string into its component languages"

Edit: If you don't mind about these kind of annoying border cases you could check out google's language detection api: https://code.google.com/p/language-detection/

It claims 99% precision for over 53 languages. That might be enough for you.

You would also have to combine this with some smart word grouping algorithm, splitting on alphabet type might be a good start for that. You can use unicode character range regular expressions to split on alphabet type, for example: /([\u0600-\u06FF]+\s*)+/ should match all groups of words written with the arabic script in a sentence.

If you're looking for specific alphabets, the full list of unicode codes can be found on wikipedia here: https://en.wikipedia.org/wiki/List_of_Unicode_characters

Edit 2: Now that you've narrowed down your problem solution, you can do it with a simple regular expression: /([a-zA-Z,.]+\s*)+/ will match all the groups of words written in the latin script. You can add more punctuation marks to that list if they're used, but remember to either start with or escape the dash, since it has special meaning within character classes. You can then simply replace those groups by themselves within div tags to solve your problem.

Upvotes: 9

Related Questions