MysteryPancake
MysteryPancake

Reputation: 1505

Javascript: Remove string punctuation and split into words?

I'm trying to get an array of words from a string like this:

"Exclamation! Question? \"Quotes.\" 'Apostrophe'. Wasn't. 'Couldn't'. \"Didn't\"."

The array is supposed to look like this:

[
  "exclamation",
  "question",
  "quotes",
  "apostrophe",
  "wasn't"
  "couldn't",
  "didn't"
]

Currently I'm using this expression:

sentence.toLowerCase().replace(/[^\w\s]/gi, "").split(" ");

The problem is, it removes apostrophes from words like "wasn't", turning it into "wasnt".

I can't figure out how to keep the apostrophes in words such as that.

Any help would be greatly appreciated!

var sentence = "Exclamation! Question? \"Quotes.\" 'Apostrophe'. Wasn't. 'Couldn't'. \"Didn't\".";
console.log(sentence.toLowerCase().replace(/[^\w\s]/gi, "").split(" "));

Upvotes: 3

Views: 5094

Answers (2)

revo
revo

Reputation: 48711

That would be tricky to work around your own solution but you could consider apostrophes this way:

sentence = `"Exclamation! Question? \"Quotes.\" 'Apostrophe'. Wasn't. 'Couldn't'. \"Didn't\"."`;
console.log(
    sentence.match(/\w+(?:'\w+)*/g)
);

Note: changed quantifier from ? to * to allow multiple ' in a word.

Upvotes: 4

Jeto
Jeto

Reputation: 14927

@revo's answer looks good, here's another option that should work too:

const input = "Exclamation! Question? \"Quotes.\" 'Apostrophe'. Wasn't. 'Couldn't'. \"Didn't\".";
console.log(input.toLowerCase().match(/\b[\w']+\b/g));

Explanation:

  • \b matches at the beginning/end of a word,
  • [\w']+ matches anything that's either letters, digits, underscores or quotes (to omit underscores, you can use [a-zA-Z0-9']instead),
  • /g tells the regex to capture all occurrences that match that pattern (not just the first one).

Upvotes: 2

Related Questions