Reputation: 2307
I am writing code that breaks text into words and does stuff like counting word sizes and so forth.
I came up with this (after some searching):
$text = preg_replace("/[^[:alnum:][:space:]]/u", ' ', $text);
$words = mb_split( ' +', $text );
However, contractions don't work because apostrophes and single quote look the same (because they are).
I need a way to separate out words but include contractions. For now, I've included all the contractions I could think of as stopwords but that's most unsatisfactory. I'm not great with regex and need some advice.
Although I posted my own inelegant solution, I am leaving this question open in the hope of encouraging a more perfect answer.
Upvotes: 0
Views: 185
Reputation: 2307
I've been labouring at this for a while. The comments and Taha Paksu's remarkably effective solution were helpful in helping me think through the problem. Taha Paksu's solution cleanly isolated words except when it came to accented letters. Google search seems to suggest RegEx is not so friendly with non-ascii characters.
It was when I gave up trying to do regex voodoo (anyone who can has my deepest respect) that I came up with this not so elegant hack.
$text = "Testing text. Café is spelled true. And pokémon too... ‘bad quotes’. (brackets)... Löwen, Bären, Vögel und Käfer sind Tiere. That’s what I said.";
$text = str_replace(array('’',"'"), '000AP000', $text);
$text = str_replace("-", '000HY000', $text);
$text = preg_replace("/[^[:alnum:][:space:]]/u", ' ', $text);
$text = str_replace('000AP000', "'", $text);
$text = str_replace('000HY000', "-", $text);
$text = str_replace(array("' ",'- ',' '," '",' -',' '), ' ', $text);
$words = mb_split( ' +', $text );
It uses two statistically unlikely strings as place holders, cleans the rest, drops the hyphens and apostrophes back in and then takes out anything touching spaces (and multiple spaces). It works for everything I can find to throw at it.
I'd like to find a less fiddly solution if I can but my regex skills might not be up to the task (even with a cheat-sheet open).
Upvotes: 0
Reputation: 15616
Found a better way, using word boundaries and characters allowed in words, you can directly count the words:
<?php
$text = "One morning, when Gregor Samsa woke from troubled dreams,
he found himself transformed in his bed into a horrible vermin.
'He lay on his armour-like back', and if he lifted his head a
little he could see his brown belly, slightly domed and divided by arches
into stiff sections. The bedding was hardly able to cover it and
seemed ready to slide off any moment. His many legs, pitifully thin
compared with the size of the rest of him, waved about helplessly as he
looked. \"What's happened to me?\" he thought. It wasn't a dream. His
room, a proper human room although a little too small, lay peacefully
between its four familiar walls. A collection of textile samples lay
spread out on the table - Samsa was a travelling salesman - and
above it there hung a picture that he had recently cut out of an
illustrated magazine and housed in a nice, gilded frame. It showed
a lady fitted out with a fur hat and fur boa who sat upright,
raising a heavy fur muff that covered the whole of her lower arm
towards the viewer. Gregor then turned to look out the window at the
dull weather";
preg_match_all("/\b[\w'-]+\b/", $text, $words);
print_r(count($words[0]));
Note: I allowed - with ' to be existed inside a word. Like "armour-like" will count as one word.
Regex Test: regexr.com/4ego6
Upvotes: 1