Reputation: 1
I would like to use regex in php to separate words and phrases out of a string. The phrases would be separated by quotes, both double and single. The regular expression would also have to take in consideration single quotes within words (ie nation's).
Example string:
The nation's economy 'is really' poor, but "might be getting" better.
I would like php to separate this type of string into an array using a regex as follows:
Array
(
[0] => "The"
[1] => "nation's"
[2] => "economy"
[3] => "is really"
[4] => "poor"
[5] => "but"
[6] => "might be getting"
[7] => "better"
)
What would the php code be to accomplish this?
Upvotes: -1
Views: 288
Reputation: 523714
Use preg_match_all
on the regex:
(?<![\w'"])(?:['"][^'"]+['"]|[\w']+)(?![\w'"])
Example: https://3v4l.org/vBGY7
preg_match_all(
'/(?<![\w\'"])(?:[\'"][^\'"]+[\'"]|[\w\']+)(?![\w\'"])/',
"The nation's economy 'is really' poor, but \"might be getting\" better.",
$matches
);
print_r($matches[0]);
(Note that this doesn't recognize hy-phe-nat-ed words as it is not specified in the question.)
Output (containing quote wrappings):
Array
(
[0] => The
[1] => nation's
[2] => economy
[3] => 'is really'
[4] => poor
[5] => but
[6] => "might be getting"
[7] => better
)
Upvotes: 2
Reputation: 48041
To split the text as required, match the start of the string or a literal space because all matches will follow one of those.
For the quoted text, capture the leading quote, match zero or more characters which are not that specific quote, then match the corresponding trailing quote.
For the non-quoted text, match all characters which are not spaces or unwanted punctuation.
Code: (Demo)
$str = <<<TEXT
The nation's economy 'is really' poor, but "might be getting" better.
TEXT;
$pattern = '#(?:^| )(?|([\'"])((?:(?!\1).)*)\1|()([^,. ]+))#';
preg_match_all($pattern, $str, $m);
var_export($m[2]);
Pattern Breakdown:
(?:^| ) #match the start of the string or a space
(?| #use a branch reset to ensure not group 3 or 4 in result
([\'"]) #capture the leading quote as group 1
((?:(?!\1).)*) #capture zero or more non-quote characters as group 2
\1 #match the trailing quote
| #or
() #capture nothing as group 1
([^,. ]+) #capture the unquoted "word" as group 2
)
Output:
array (
0 => 'The',
1 => 'nation\'s',
2 => 'economy',
3 => 'is really',
4 => 'poor',
5 => 'but',
6 => 'might be getting',
7 => 'better',
)
Upvotes: -1
Reputation: 70731
$str = <<< END
The nation's economy 'is really' poor, but "might be getting" better.
END;
$str = ' ' . $str . ' '; // add surrounding spaces to make things easier
$regex = '/(?<=\s)(".*?"|\'.*?\'|.*?)(?=\s)/';
preg_match_all($regex, $str, $matches);
// strip commas and surrounding quotes from resulting words
$words = $matches[0];
foreach ($words as &$word)
$word = trim($word, ' ,\'"');
print_r($words);
Upvotes: 0