jaksco
jaksco

Reputation: 531

Simple phrase detection, split by phrase regex

I'd like to split up a string like:

Input: Bangalore railway line of the Indian Railway. It comes under Nagpur division of the Central Railway.

Output:

Bangalore 
railway 
line
Indian Railway 
comes
under 
Nagpur 
division
Central Railway

Notice that compound nouns would be kept together because they are Title Case.

I'm having trouble with the regex part specifically: split(/(?=\s[a-z]|[A-Z]\s|\.)/)

How do I get it to split on the 'water ꜜ Tor Museum' scenario ?

export function splitByPhrase(text: string) {
  const outputFreq = text
    .split(/(?=\s[a-z]|[A-Z]\s|\.)/)
    .filter(Boolean)
    .map((x) => x.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g, "").trim())
    .filter((x) => !stopWords.includes(x));

  return outputFreq;
}

describe("phrases", () => {
  it("no punctuation", () => {
    expect(splitByPhrase("test. Toronto")).toEqual(["test", "Toronto"]);
  });
  it("no spaces", () => {
    expect(splitByPhrase(" test Toronto ")).toEqual(["test", "Toronto"]);
  });
  it("simple phrase detection", () => {
    expect(splitByPhrase(" water Tor Museum wants")).toEqual(["water", "Tor Museum", "wants"]);
  });
  it("remove stop words", () => {
    expect(splitByPhrase("Toronto a Museum with")).toEqual(["Toronto", "Museum"]);
  });
});

Upvotes: 2

Views: 58

Answers (2)

The fourth bird
The fourth bird

Reputation: 163277

You might add another alternative to split only when asserting what is on the left is not an uppercase char followed by lowercase chars and at the right there is no uppercase char.

(?= [a-z]|\.|(?<!\b[A-Z][a-z]*) (?=[A-Z]))

Regex demo

const stopWords = [
  "of", "The", "It", "the", "a", "with"
];

function splitByPhrase(text) {
  return text
    .split(/(?= [a-z]|\.|(?<!\b[A-Z][a-z]*) (?=[A-Z]))/)
    .map((x) => x.replace(/[.,\/#!$%^&*;:{}=_`~()-]/g, "").trim())
    .filter((x) => !stopWords.includes(x)).filter(Boolean);
}

[
  "Bangalore railway line of the Indian Railway. It comes under Nagpur division of the Central Railway.",
  "test. Toronto",
  " test Toronto ",
  " water Tor Museum wants",
  "Toronto a Museum with"
].forEach(i => console.log(splitByPhrase(i)));

Upvotes: 1

Joa
Joa

Reputation: 15

For the case of slitting a lower case word before a Title Case word, I think split(\s(?=[a-z]|[A-Z]\w+ |\.)) works for what you want.

https://regexr.com/59jfo

Input: Bangalore railway line of the Indian Railway. It comes under Nagpur division of the Central Railway.

Output:

Bangalore
railway
line
of
the
Indian Railway.
It
comes
under
Nagpur
division
of
the
Central Railway.

Upvotes: 1

Related Questions