user501144
user501144

Reputation: 1

Get words and quoted phrases from text as an array

I would like to use regex in php to separate words and phrases out of a string. The phrases would be separated by quotes, both double and single. The regular expression would also have to take in consideration single quotes within words (ie nation's).

Example string:

The nation's economy 'is really' poor, but "might be getting" better.

I would like php to separate this type of string into an array using a regex as follows:

Array
(
    [0] => "The"
    [1] => "nation's"
    [2] => "economy"
    [3] => "is really"
    [4] => "poor"
    [5] => "but"
    [6] => "might be getting"
    [7] => "better"
)

What would the php code be to accomplish this?

Upvotes: -1

Views: 288

Answers (3)

kennytm
kennytm

Reputation: 523714

Use preg_match_all on the regex:

(?<![\w'"])(?:['"][^'"]+['"]|[\w']+)(?![\w'"])

Example: https://3v4l.org/vBGY7

preg_match_all(
  '/(?<![\w\'"])(?:[\'"][^\'"]+[\'"]|[\w\']+)(?![\w\'"])/', 
  "The nation's economy 'is really' poor, but \"might be getting\" better.",
  $matches
);
 
print_r($matches[0]);

(Note that this doesn't recognize hy-phe-nat-ed words as it is not specified in the question.)

Output (containing quote wrappings):

Array
(
    [0] => The
    [1] => nation's
    [2] => economy
    [3] => 'is really'
    [4] => poor
    [5] => but
    [6] => "might be getting"
    [7] => better
)

Upvotes: 2

mickmackusa
mickmackusa

Reputation: 48041

To split the text as required, match the start of the string or a literal space because all matches will follow one of those.

For the quoted text, capture the leading quote, match zero or more characters which are not that specific quote, then match the corresponding trailing quote.

For the non-quoted text, match all characters which are not spaces or unwanted punctuation.

Code: (Demo)

$str = <<<TEXT
The nation's economy 'is really' poor, but "might be getting" better.
TEXT;

$pattern = '#(?:^| )(?|([\'"])((?:(?!\1).)*)\1|()([^,. ]+))#';

preg_match_all($pattern, $str, $m);
var_export($m[2]);

Pattern Breakdown:

(?:^| )              #match the start of the string or a space
(?|                  #use a branch reset to ensure not group 3 or 4 in result
   ([\'"])           #capture the leading quote as group 1
   ((?:(?!\1).)*)    #capture zero or more non-quote characters as group 2
   \1                #match the trailing quote
  |                  #or
   ()                #capture nothing as group 1
   ([^,. ]+)         #capture the unquoted "word" as group 2
)

Output:

array (
  0 => 'The',
  1 => 'nation\'s',
  2 => 'economy',
  3 => 'is really',
  4 => 'poor',
  5 => 'but',
  6 => 'might be getting',
  7 => 'better',
)

Upvotes: -1

casablanca
casablanca

Reputation: 70731

$str = <<< END
The nation's economy 'is really' poor, but "might be getting" better.
END;
$str = ' ' . $str . ' '; // add surrounding spaces to make things easier

$regex = '/(?<=\s)(".*?"|\'.*?\'|.*?)(?=\s)/';

preg_match_all($regex, $str, $matches);

// strip commas and surrounding quotes from resulting words
$words = $matches[0];
foreach ($words as &$word)
  $word = trim($word, ' ,\'"');

print_r($words);

Upvotes: 0

Related Questions