hippietrail
hippietrail

Reputation: 17023

Split string in JavaScript using regex with zero width lookbehind

I know JavaScript regular expressions have native lookaheads but not lookbehinds.

I want to split a string at points either beginning with any member of one set of characters or ending with any member of another set of characters.

Split before , , , , . Split after .

In: ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູດ

Out: ເລື້ອຍໆມະ ຫັດສະ ຈັນ ເອກອັກຄະ ລັດຖະ ທູດ

I can achieve the "split before" part using zero-width lookahead:

'ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູດ'.split(/(?=[ໃໄໂເແ])/)

["ເລື້ອຍໆມະຫັດສະຈັນ", "ເອກອັກຄະລັດຖະທູດ"]

But I can't think of a general approach to simulating zero-width lookbehind

I'm splitting strings of arbitrary Unicode text so don't want to substitute in special markers in a first pass, since I can't guarantee the absence of any string from my input.

Upvotes: 1

Views: 364

Answers (3)

hwnd
hwnd

Reputation: 70750

Instead of spliting, you may consider using the match() method.

var s = 'ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູດ',
    r = s.match(/(?:(?!ະ).)+?(?:ະ|(?=[ໃໄໂເແ]|$))/g);

console.log(r); //=> [ 'ເລື້ອຍໆມະ', 'ຫັດສະ', 'ຈັນ', 'ເອກອັກຄະ', 'ລັດຖະ', 'ທູດ' ]

Upvotes: 3

Mark Reed
Mark Reed

Reputation: 95385

If you use parentheses in the delimited regex, the captured text is included in the returned array. So you can just split on /(ະ)/ and then concatenate each of the odd members of the resulting array to the preceding even member. Example:

"ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູ".split(/(ະ)/).reduce(function(arr,str,index) {
   if (index%2 == 0) { 
     arr.push(str); 
   } else { 
     arr[arr.length-1] += str
   }; 
   return arr;
 },[])

Result: ["ເລື້ອຍໆມະ", "ຫັດສະ", "ຈັນເອກອັກຄະ", "ລັດຖະ", "ທູ"]

You can do another pass to split on the lookahead:

"ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູ".split(/(ະ)/).reduce(function(arr,str,index) {
   if (index%2 == 0) { 
     arr.push(str); 
   } else { 
     arr[arr.length-1] += str
   }; 
   return arr;
 },[]).reduce(function(arr,str){return arr.concat(str.split(/(?=[ໃໄໂເແ])/));},[]);

Result: ["ເລື້ອຍໆມະ", "ຫັດສະ", "ຈັນ", "ເອກອັກຄະ", "ລັດຖະ", "ທູ"]

Upvotes: 1

Avinash Raj
Avinash Raj

Reputation: 174874

You could try matching rather than splitting,

> var re = /((?:(?!ະ).)+(?:ະ|$))/g;
undefined
> var str = "ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູດ"
undefined
> var m;
undefined
> while ((m = re.exec(str)) != null) {
... console.log(m[1]);
... }
ເລື້ອຍໆມະ
ຫັດສະ
ຈັນເອກອັກຄະ
ລັດຖະ
ທູດ

Then again split the elements in the array using lookahead.

Upvotes: 1

Related Questions