loretoparisi
loretoparisi

Reputation: 16301

Turn a Sed function a JavaScript Regex

I have this text normalization function in bash

normalize_text() {
  tr '[:upper:]' '[:lower:]' | sed -e 's/^/__label__/g' | \
    sed -e "s/'/ ' /g" -e 's/"//g' -e 's/\./ \. /g' -e 's/<br \/>/ /g' \
        -e 's/,/ , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
        -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' | tr -s " "
}

I have to transform it into a JavaScript RegExp.

This is my partial implementation

        text=text.toLowerCase();
        text=text.replace(/(?:\\[rn]|[\r\n]+)+/g, " ");
        text=text.replace(/'/g, " ' ");
        text=text.replace(/"/g, '');
        text=text.replace(/\./g, ' \. ');
        text=text.replace(/,/g, ' \, ');
        text=text.replace(/\(/g, ' ( ');
        text=text.replace(/\)/g, ' ) ');
        text=text.replace(/!/g, ' ! ');
        text=text.replace(/\?/g, ' ! ');
        text=text.replace(/;/g, ' ');
        text=text.replace(/:/g, ' ');
        text=text.replace(/\t+/g,'\t').replace(/\t\s/g,' ').replace(/\t/g,' ');

Despite of this implementation, when I'm using the JavaScript version to generate the file (using FastCSV node library) it will create a bad CSV, resulting in a parsing error when reading like

Error: Parse Error: expected: '"' got: 'i'. at 'i met her 

While when normalizing the file with sed and then reading with FastCSV it will work properly.

Upvotes: 1

Views: 1030

Answers (1)

Tamas Rev
Tamas Rev

Reputation: 7166

I think you can try the following code. Demo is here.

text = text.replace(/^/gm, '__label__');
text = text.replace(/"/g, '');
text = text.replace(/<br \/>/g, ' ');
text = text.replace(/([()!?.',])/g, ' $1 ');
text = text.replace(/[;:]/g, ' ');
text = text.replace(/ +/g, ' ');

Explanation:

  • sed-s 's/^/__label__/' adds '__label__' to the beginning of each line. In js you need the multiline modifier, /m for that.
  • eliminating the quotes is easy to translate from sed to js: -e 's/"//g' becomes text = text.replace(/"/g, '');
  • Replacing line-breaks to space is basically the same: -e 's/<br \/>/ /g' becomes text = text.replace(/<br \/>/g, ' ');.
  • You add spaces around several characters. I lumped them into a single replace: text = text.replace(/([()!?.',])/g, ' $1 ');.
    • You can specify multiple characters in a character class: [...]. It will match 1 character if it's mentioned within the brackets. There are some tricks though with the ^ and the - characters - you can check them here.
    • This character class is within a capturing group: (...) so we can refer to it with $1 within the replacement.
  • You want to replace some characters with a space. I lumped them together like this: text = text.replace(/[;:]/g, ' ');.
  • I'm not familiar with the tr command. I believe in this case it replaces multiple spaces with one. You can do it with a regex like this: text = text.replace(/ +/g, ' ');.

Upvotes: 1

Related Questions