Reputation: 16301
I have this text normalization function in bash
normalize_text() {
tr '[:upper:]' '[:lower:]' | sed -e 's/^/__label__/g' | \
sed -e "s/'/ ' /g" -e 's/"//g' -e 's/\./ \. /g' -e 's/<br \/>/ /g' \
-e 's/,/ , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
-e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' | tr -s " "
}
I have to transform it into a JavaScript RegExp
.
This is my partial implementation
text=text.toLowerCase();
text=text.replace(/(?:\\[rn]|[\r\n]+)+/g, " ");
text=text.replace(/'/g, " ' ");
text=text.replace(/"/g, '');
text=text.replace(/\./g, ' \. ');
text=text.replace(/,/g, ' \, ');
text=text.replace(/\(/g, ' ( ');
text=text.replace(/\)/g, ' ) ');
text=text.replace(/!/g, ' ! ');
text=text.replace(/\?/g, ' ! ');
text=text.replace(/;/g, ' ');
text=text.replace(/:/g, ' ');
text=text.replace(/\t+/g,'\t').replace(/\t\s/g,' ').replace(/\t/g,' ');
Despite of this implementation, when I'm using the JavaScript version to generate the file (using FastCSV node library) it will create a bad CSV, resulting in a parsing error when reading like
Error: Parse Error: expected: '"' got: 'i'. at 'i met her
While when normalizing the file with sed
and then reading with FastCSV
it will work properly.
Upvotes: 1
Views: 1030
Reputation: 7166
I think you can try the following code. Demo is here.
text = text.replace(/^/gm, '__label__');
text = text.replace(/"/g, '');
text = text.replace(/<br \/>/g, ' ');
text = text.replace(/([()!?.',])/g, ' $1 ');
text = text.replace(/[;:]/g, ' ');
text = text.replace(/ +/g, ' ');
Explanation:
's/^/__label__/'
adds '__label__'
to the beginning of each line. In js you need the multiline modifier, /m
for that.-e 's/"//g'
becomes text = text.replace(/"/g, '');
-e 's/<br \/>/ /g'
becomes text = text.replace(/<br \/>/g, ' ');
.text = text.replace(/([()!?.',])/g, ' $1 ');
.
[...]
. It will match 1 character if it's mentioned within the brackets. There are some tricks though with the ^
and the -
characters - you can check them here.(...)
so we can refer to it with $1
within the replacement.text = text.replace(/[;:]/g, ' ');
.tr
command. I believe in this case it replaces multiple spaces with one. You can do it with a regex like this: text = text.replace(/ +/g, ' ');
.Upvotes: 1