Reputation: 586
I need to extract the words and phrases within a text. For example, the text is:
Hello World, "Japan and China", Americans, Asians, "Jews and Christians", and semi-catholics, Jehovah's witnesses
Using preg_split(), it should return the following:
I need to know the RegEx for this to work (or is it possible?). Notice the rules, phrases are enclosed in quotes ("). Alphanumerics, single quotes (') and dashes (-) are considered part of the word (that's why "Jehova's" and "semi-catholics" are considered one word), the rest separated with spaces are considered as single words, while other symbols not mentioned are ignored
Upvotes: 1
Views: 1406
Reputation: 41905
You can actually do it very simply with str_getcsv like this:
// replace any comma or space by a singe space
$str = preg_replace('/(,+[ ]+)|([ ]+)/', ' ', $str);
// treat the input as CSV, the delimiters being spaces and enclusures double quotes
print_r(str_getcsv($str, ' ', '"'));
output:
Array
(
[0] => Hello
[1] => World
[2] => Japan and China
[3] => Americans
[4] => Asians
[5] => Jews and Christians
[6] => and
[7] => semi-catholics
[8] => Jehovah's
[9] => witnesses
)
Upvotes: 1
Reputation: 133
If your example string is typical, begin by dealing with the single and double quotes. I have used the heredoc syntax here to make the string safe to work with.
$string = <<<TEST
Hello World, "Japan and China", Americans, Asians, "Jews and Christians", and semi-catholics, Jehovah's witnesses
TEST;
$safe_string = addslashes($string);//make the string safe to work with
$pieces = explode(",",$safe_string);//break into pieces on comma
$words_and_phrases = array();//initiate new array
foreach($pieces as $piece)://begin working with the pieces
$piece = trim($piece);//a little clean up
if(strpos($piece,'"'))://this is a phrase
$words_and_phrases[] = str_replace('"','',stripslashes($piece));
else://else, these are words
$words = explode(" ",stripslashes($piece));
$words_and_phrases = array_merge($words_and_phrases, $words);
endif;
endforeach;
print_r($words_and_phrases);
Note: You could also use preg_replace, but it seems like overkill for something like this.
Upvotes: 0