Ronald Borla
Ronald Borla

Reputation: 586

How to extract words and phrases using preg_split() in PHP?

I need to extract the words and phrases within a text. For example, the text is:

Hello World, "Japan and China", Americans, Asians, "Jews and Christians", and semi-catholics, Jehovah's witnesses

Using preg_split(), it should return the following:

  1. Hello
  2. World
  3. Japan and China
  4. Americans
  5. Asians
  6. Jews and Christians
  7. and
  8. semi-catholics
  9. Jehova's
  10. witnesses

I need to know the RegEx for this to work (or is it possible?). Notice the rules, phrases are enclosed in quotes ("). Alphanumerics, single quotes (') and dashes (-) are considered part of the word (that's why "Jehova's" and "semi-catholics" are considered one word), the rest separated with spaces are considered as single words, while other symbols not mentioned are ignored

Upvotes: 1

Views: 1406

Answers (2)

Benjamin Crouzier
Benjamin Crouzier

Reputation: 41905

You can actually do it very simply with str_getcsv like this:

// replace any comma or space by a singe space
$str = preg_replace('/(,+[ ]+)|([ ]+)/', ' ', $str);
// treat the input as CSV, the delimiters being spaces and enclusures double quotes
print_r(str_getcsv($str, ' ', '"'));

output:

Array
(
    [0] => Hello
    [1] => World
    [2] => Japan and China
    [3] => Americans
    [4] => Asians
    [5] => Jews and Christians
    [6] => and
    [7] => semi-catholics
    [8] => Jehovah's
    [9] => witnesses
)

Upvotes: 1

Odyssey
Odyssey

Reputation: 133

If your example string is typical, begin by dealing with the single and double quotes. I have used the heredoc syntax here to make the string safe to work with.

$string = <<<TEST
Hello World, "Japan and China", Americans, Asians, "Jews and Christians", and semi-catholics, Jehovah's witnesses
TEST;
$safe_string = addslashes($string);//make the string safe to work with
$pieces = explode(",",$safe_string);//break into pieces on comma
$words_and_phrases = array();//initiate new array

foreach($pieces as $piece)://begin working with the pieces
    $piece = trim($piece);//a little clean up
    if(strpos($piece,'"'))://this is a phrase
        $words_and_phrases[] = str_replace('"','',stripslashes($piece));
    else://else, these are words
        $words = explode(" ",stripslashes($piece));
        $words_and_phrases = array_merge($words_and_phrases, $words);
    endif;
endforeach;
print_r($words_and_phrases);

Note: You could also use preg_replace, but it seems like overkill for something like this.

Upvotes: 0

Related Questions