Reputation: 531
I'd like to split up a string like:
Input: Bangalore railway line of the Indian Railway. It comes under Nagpur division of the Central Railway.
Output:
Bangalore
railway
line
Indian Railway
comes
under
Nagpur
division
Central Railway
Notice that compound nouns would be kept together because they are Title Case.
I'm having trouble with the regex part specifically: split(/(?=\s[a-z]|[A-Z]\s|\.)/)
How do I get it to split on the 'water ꜜ Tor Museum' scenario ?
export function splitByPhrase(text: string) {
const outputFreq = text
.split(/(?=\s[a-z]|[A-Z]\s|\.)/)
.filter(Boolean)
.map((x) => x.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g, "").trim())
.filter((x) => !stopWords.includes(x));
return outputFreq;
}
describe("phrases", () => {
it("no punctuation", () => {
expect(splitByPhrase("test. Toronto")).toEqual(["test", "Toronto"]);
});
it("no spaces", () => {
expect(splitByPhrase(" test Toronto ")).toEqual(["test", "Toronto"]);
});
it("simple phrase detection", () => {
expect(splitByPhrase(" water Tor Museum wants")).toEqual(["water", "Tor Museum", "wants"]);
});
it("remove stop words", () => {
expect(splitByPhrase("Toronto a Museum with")).toEqual(["Toronto", "Museum"]);
});
});
Upvotes: 2
Views: 58
Reputation: 163277
You might add another alternative to split only when asserting what is on the left is not an uppercase char followed by lowercase chars and at the right there is no uppercase char.
(?= [a-z]|\.|(?<!\b[A-Z][a-z]*) (?=[A-Z]))
const stopWords = [
"of", "The", "It", "the", "a", "with"
];
function splitByPhrase(text) {
return text
.split(/(?= [a-z]|\.|(?<!\b[A-Z][a-z]*) (?=[A-Z]))/)
.map((x) => x.replace(/[.,\/#!$%^&*;:{}=_`~()-]/g, "").trim())
.filter((x) => !stopWords.includes(x)).filter(Boolean);
}
[
"Bangalore railway line of the Indian Railway. It comes under Nagpur division of the Central Railway.",
"test. Toronto",
" test Toronto ",
" water Tor Museum wants",
"Toronto a Museum with"
].forEach(i => console.log(splitByPhrase(i)));
Upvotes: 1
Reputation: 15
For the case of slitting a lower case word before a Title Case word, I think split(\s(?=[a-z]|[A-Z]\w+ |\.))
works for what you want.
Input: Bangalore railway line of the Indian Railway. It comes under Nagpur division of the Central Railway.
Output:
Bangalore
railway
line
of
the
Indian Railway.
It
comes
under
Nagpur
division
of
the
Central Railway.
Upvotes: 1