Reputation:
I need to support exact phrases (enclosed in quotes) in an otherwise space-separated list of terms. Thus splitting the respective string by the space-character is not sufficient anymore.
Example:
input : 'foo bar "lorem ipsum" baz'
output: ['foo', 'bar', 'lorem ipsum', 'baz']
I wonder whether this could be achieved with a single RegEx, rather than performing complex parsing or split-and-rejoin operations.
Any help would be greatly appreciated!
Upvotes: 10
Views: 26682
Reputation: 561
Expanding on the accepted answer, here's a search engine parser that,
Treating phrases as regular expressions makes the UI simpler for my purposes.
const matchOrIncludes = (str, search, useMatch = true) => {
if (useMatch) {
let result = false
try {
result = str.match(search)
} catch (err) {
return false
}
return result
}
return str.includes(search)
}
const itemMatches = (item, searchString, fields) => {
const keywords = searchString.toString().replace(/\s\s+/g, ' ').trim().toLocaleLowerCase().match(/(-?"[^"]+"|[^"\s]+)/g) || []
for (let i = 0; i < keywords.length; i++) {
const negateWord = keywords[i].startsWith('-') ? true : false
let word = keywords[i].replace(/^-/,'')
const isPhraseRegex = word.startsWith('"') ? true : false
if (isPhraseRegex) {
word = word.replace(/^"(.+)"$/,"$1")
}
let word_in_item = false
for (const field of fields) {
if (item[field] && matchOrIncludes(item[field].toLocaleLowerCase(), word, isPhraseRegex)) {
word_in_item = true
break
}
}
if ((! negateWord && ! word_in_item) || (negateWord && word_in_item)) {
return false
}
}
return true
}
const item = {title: 'My title', body: 'Some text'}
console.log(itemMatches(item, 'text', ['title', 'body']))
Upvotes: 0
Reputation: 7874
ES6 solution supporting:
Code:
input.match(/\\?.|^$/g).reduce((p, c) => {
if(c === '"'){
p.quote ^= 1;
}else if(!p.quote && c === ' '){
p.a.push('');
}else{
p.a[p.a.length-1] += c.replace(/\\(.)/,"$1");
}
return p;
}, {a: ['']}).a
Output:
[ 'foo', 'bar', 'lorem ipsum', 'baz' ]
Upvotes: 1
Reputation: 5655
This might be a very late answer, but I am interested in answering
([\w]+|\"[\w\s]+\")
http://regex101.com/r/dZ1vT6/72
Pure javascript example
'The rain in "SPAIN stays" mainly in the plain'.match(/[\w]+|\"[\w\s]+\"/g)
Outputs:
["The", "rain", "in", ""SPAIN stays"", "mainly", "in", "the", "plain"]
Upvotes: 0
Reputation: 2547
One that's easy to understand and a general solution. Works for all delimiters and 'join' characters. Also supports 'joined' words that are more than two words in length.... ie lists like
"hello my name is 'jon delaware smith fred' I have a 'long name'"
....
A bit like the answer by AC but a bit neater...
function split(input, delimiter, joiner){
var output = [];
var joint = [];
input.split(delimiter).forEach(function(element){
if (joint.length > 0 && element.indexOf(joiner) === element.length - 1)
{
output.push(joint.join(delimiter) + delimiter + element);
joint = [];
}
if (joint.length > 0 || element.indexOf(joiner) === 0)
{
joint.push(element);
}
if (joint.length === 0 && element.indexOf(joiner) !== element.length - 1)
{
output.push(element);
joint = [];
}
});
return output;
}
Upvotes: 0
Reputation:
Thanks a lot for the quick responses!
Here's a summary of the options, for posterity:
var input = 'foo bar "lorem ipsum" baz';
output = input.match(/("[^"]+"|[^"\s]+)/g);
output = input.match(/"[^"]*"|\w+/g);
output = input.match(/("[^"]*")|([^\s"]+)/g)
output = /(".+?"|\w+)/g.exec(input);
output = /"(.+?)"|(\w+)/g.exec(input);
For the record, here's the abomination I had come up with:
var input = 'foo bar "lorem ipsum" "dolor sit amet" baz';
var terms = input.split(" ");
var items = [];
var buffer = [];
for(var i = 0; i < terms.length; i++) {
if(terms[i].indexOf('"') != -1) { // outer phrase fragment -- N.B.: assumes quote is either first or last character
if(buffer.length === 0) { // beginning of phrase
//console.log("start:", terms[i]);
buffer.push(terms[i].substr(1));
} else { // end of phrase
//console.log("end:", terms[i]);
buffer.push(terms[i].substr(0, terms[i].length - 1));
items.push(buffer.join(" "));
buffer = [];
}
} else if(buffer.length != 0) { // inner phrase fragment
//console.log("cont'd:", terms[i]);
buffer.push(terms[i]);
} else { // individual term
//console.log("standalone:", terms[i]);
items.push(terms[i]);
}
//console.log(items, "\n", buffer);
}
items = items.concat(buffer);
//console.log(items);
Upvotes: 2
Reputation: 12617
Try this:
var input = 'foo bar "lorem ipsum" baz';
var R = /(\w|\s)*\w(?=")|\w+/g;
var output = input.match(R);
output is ["foo", "bar", "lorem ipsum", "baz"]
Note there are no extra double quotes around lorem ipsum
Although it assumes the input has the double quotes in the right place:
var input2 = 'foo bar lorem ipsum" baz'; var output2 = input2.match(R);
var input3 = 'foo bar "lorem ipsum baz'; var output3 = input3.match(R);
output2 is ["foo bar lorem ipsum", "baz"]
output3 is ["foo", "bar", "lorem", "ipsum", "baz"]
And won't handle escaped double quotes (is that a problem?):
var input4 = 'foo b\"ar bar\" \"bar "lorem ipsum" baz';
var output4 = input4.match(R);
output4 is ["foo b", "ar bar", "bar", "lorem ipsum", "baz"]
Upvotes: 4
Reputation: 542
var str = 'foo bar "lorem ipsum" baz';
var results = str.match(/("[^"]+"|[^"\s]+)/g);
... returns the array you're looking for.
Note, however:
replace(/^"([^"]+)"$/,"$1")
on the results.lorem
and ipsum
, they'll be in the result. You can fix this by running replace(/\s+/," ")
on the results."
after ipsum
(i.e. an incorrectly-quoted phrase) you'll end up with: ['foo', 'bar', 'lorem', 'ipsum', 'baz']
Upvotes: 17
Reputation: 261
A simple regular expression will do but leave the quotation marks. e.g.
'foo bar "lorem ipsum" baz'.match(/("[^"]*")|([^\s"]+)/g)
output: ['foo', 'bar', '"lorem ipsum"', 'baz']
edit: beaten to it by shyamsundar, sorry for the double answer
Upvotes: 2
Reputation: 31
how about,
output = /(".+?"|\w+)/g.exec(input)
then do a pass on output to lose the quotes.
alternately,
output = /"(.+?)"|(\w+)/g.exec(input)
then do a pass n output to lose the empty captures.
Upvotes: 1
Reputation: 9368
'foo bar "lorem ipsum" baz'.match(/"[^"]*"|\w+/g);
the bounding quotes get included though
Upvotes: 1