Albertine
Albertine

Reputation: 25

Split text into words & numbers with unicode support (preg_split)

I'm trying to split (with preg_split) a text with a lot of foreign chars and digits into words and numbers with length >= 2 and without ponctuation. Now I have this code but it only split into words without taking account digits and length >= 2 for all. How can I do please?

$text = 'abc 文 字化け, efg Yukarda mavi gök, asağıda yağız yer yaratıldıkta; (1998 m. siejės 7 d.). Ton pate dėina bandomkojė бойынша бірінші орында тұр (79.65 %), айына 41';
$splitted = preg_split('#\P{L}+#u', $text, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);

Expected result should be : array('abc', '字化け', 'efg', 'Yukarda', 'mavi', 'gök', 'asağıda', 'yağız', 'yer', 'yaratıldıkta', '1998', 'siejės', 'Ton', 'pate', 'dėina', 'bandomkojė', 'бойынша', 'бірінші', 'орында', 'тұр', '79.65', 'айына', '41');

NB : already tried with these docs link1 & link2 but i can't get it works :-/

Upvotes: 1

Views: 562

Answers (3)

dev-null-dweller
dev-null-dweller

Reputation: 29482

With a little hack to match digits separated by dot before matching only digits as part of the word:

preg_match_all("#(?:\d+\.\d+|\w{2,})#u", $text, $matches);
$splitted = $matches[0];

http://codepad.viper-7.com/X7Ln1V

Upvotes: 1

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89584

Use preg_match_all instead, then you can check the length condition (that is hard to do with preg_split, but not impossible):

$text = 'abc 文 字化け, efg Yukarda mavi gök, asağıda yağız yer yaratıldıkta; (1998 m. siejės 7 d.). Ton pate dėina bandomkojė бойынша бірінші орында тұр (79.65 %), айына 41';
preg_match_all('~\p{L}{2,}+|\d{2,}+(?>\.\d++)?|\d\.\d++~u',$text,$matches);
print_r($matches);

explanation:

   p{L}{2,}+          # letter 2 or more times
|                  # OR
   \d{2,}+            # digit 2 or more times
   (?>\.\d++)?        # can be a decimal number
|                  # OR
   \d\.\d++           # single digit MUST be followed by at least a decimal 
                      # (length constraint)

Upvotes: 2

DaleJ
DaleJ

Reputation: 3

Splitting CJK into "words" is kind of meaningless. Each character is a word. If you use whitespace the you split into phrases.

So it depends on what you're actually trying to accomplish. If you're indexing text, then you need to consider bigrams and/or CJK idioms.

Upvotes: 0

Related Questions