Exact Finder
Exact Finder

Reputation: 73

how to split string by double quoted word by avoiding escaped quote

How to split the below string

var test = 'sample "test""test2"   "test3\\"" sample2"last';

into an array ['sample','"test"','"test2"','"test3\\""','sample2"last'] using javascript regx ?

Some sample input and expected output are added below.

sample1 : ' test1 "test2" test3 "test four\\"" test" d'
output [' test1','"test2"','test3','"test four\\""','test" d']

sample2 : ' test1 test2'
output [' test1 test2']

sample3 : ' test1 "sub test2'
output [' test1 "sub test2']

sample4 : ' test1 "sub test2"'
output [' test1 ','"sub test2"']

sample5 : ' "test1" "sub test2" here'
output ['"test1"','"sub test2"', 'here']

Upvotes: 0

Views: 1204

Answers (6)

Satya
Satya

Reputation: 1

You can use https://www.npmjs.com/package/dqtokenizer

const dqtokenizer = require('dqtokenizer');

const testTokenize = (str, options) => {
    const tokens = dqtokenizer.tokenize(str, options);
    console.log();
    console.log(`str: ${str}`);
    console.log(`tokens:`);
    tokens.forEach((token, index) => console.log(`\t${index}: ${token}`));
}

const sample1 = ' test1 "test2" test3 "test four\\"" test" d';
// output [' test1','"test2"','test3','"test four\\""','test" d']
testTokenize(sample1);

const sample2 = ' test1 test2'
// output [' test1 test2']
testTokenize(sample2);

const sample3 = ' test1 "sub test2'
// output [' test1 "sub test2']
testTokenize(sample3);

const sample4 = ' test1 "sub test2"'
// output [' test1 ','"sub test2"']
testTokenize(sample4);

const sample5 = ' "test1" "sub test2" here'
// output ['"test1"','"sub test2"', 'here']
testTokenize(sample5);

Output:

str:  test1 "test2" test3 "test four\"" test" d
tokens:
        0: test1
        1: "test2"
        2: test3
        3: "test four\""
        4: test
        5: " d

str:  test1 test2
tokens:
        0: test1
        1: test2

str:  test1 "sub test2
tokens:
        0: test1
        1: "sub test2

str:  test1 "sub test2"
tokens:
        0: test1
        1: "sub test2"

str:  "test1" "sub test2" here
tokens:
        0: "test1"
        1: "sub test2"
        2: here

Upvotes: 0

anubhava
anubhava

Reputation: 785098

This regex should work for you for splitting:

/\s*"[^"\\]*(?:\\.[^"\\]*)*"\s*|.+?(?="[^"\\]*(?:\\.[^"\\]*)*"|$)/g

Code:

var input = [` test1 "test2" test3 "test four\\"" test" d`, ` test1 test2`, ` test1 "sub test2`, `' test1 "sub test2"`, ` "test1" "sub test2" here`];

const re = /\s*"[^"\\]*(?:\\.[^"\\]*)*"\s*|.+?(?="[^"\\]*(?:\\.[^"\\]*)*"|$)/g;

input.forEach(el => {
  console.log('<<', el, '>>');
  var arr = el.match(re);
  arr.forEach(i => console.log(i));
});

RegEx Details:

  • "[^"\\]*(?:\\.[^"\\]*)*": Match a quoted string ignoring escaped quotes
  • |: OR
  • .+?(?="[^"\\]*(?:\\.[^"\\]*)*"|$): Match 1+ any characters that must be followed by a quoted string or end of line.

Upvotes: 1

David Amar
David Amar

Reputation: 257

A pure regexp solution : / +|(?<!\\")(?<=")(?=")/

This matches either space(s), or empty strings that are

  • preceded by " but not \"
  • followed by "

var test = 'sample "test""test2"   "test3\\"" sample2"last';
console.log(test.split(/ +|(?<!\\")(?<=")(?=")/));

Upvotes: 0

Carsten Massmann
Carsten Massmann

Reputation: 28196

A little convoluted, but it does the job:

  • first replace the masked quotation marks with a marker string x
  • then: match any "-enclosed string parts using the RegExp.exec() method repeatedly
  • grab the first element of each result, remove the quotation marks and replace the marker string with the original quotation mark and
  • push it into the results array

var test = 'sample "test""test2"   "test3\\"" sample2"';
var x='@#@',xr= RegExp(x,'g');
var rx=/"[^"]+"/g; // matches "-enclosed strings
var a,arr=[];
while (a=rx.exec(test.replace(/\\"/g,x)))
arr.push(a[0].replace(/"/g,'').replace(xr,'"'));

console.log(arr);

Upvotes: 0

Michał Turczyn
Michał Turczyn

Reputation: 37367

If you can use negative lookbehind, you can use this pattern:

test.split(/(?<!\\)"/).map(i => i.trim()).filter(i => i != '')

Note, that negative lookbehind is recent addition to JS engines. It can be used with V8, which is used for example in Chrome.

If you are not able to use negative lookbehind, then use workaround: reverse the string, use negative lookahead, then reverse again:

test
  .split('')
  .reverse()
  .join('')
  .split(/"(?!\\)/)
  .map(i => i.trim())
  .filter(i => i != '')
  .map(i => i.split('').reverse().join(''))
  .reverse()

Patterns used:

  • "(?!\\) - negative lookahead: match " which is not followed by \
  • (?<!\\)" - negative lookbehind: match " which is not preceeded by \

Upvotes: 0

Ahmad
Ahmad

Reputation: 12717

You can split the string by non alphanumeric characters, then remove any element with 0 length.

var test = 'sample "test""test2"   "test3\"" sample2"';

var array = test.split(/\W/g).filter(e => e.length>0);

console.log(array);

Upvotes: 2

Related Questions