Egor Koshelko
Egor Koshelko

Reputation: 227

Split text into words ignoring the single quote

I am trying to use javascript's regexp to get the words out of a text but contractions should be seen as single words: "can't" should stay "can't" not "can" and "t".

I tried this:

var text = "I'd like to make it work."
var words = text.match(/\w+/g);

But it doesn't work properly on " I'd ".

How to make it treat words with the single quote as a single word, but not two words ?

Upvotes: 0

Views: 1539

Answers (4)

Yash Patil
Yash Patil

Reputation: 71

Try the following regex:

/[\w']*[^\d\W]/g

Upvotes: 0

Joseph Myers
Joseph Myers

Reputation: 6552

If you want to match domains and other word-like objects that are dot rather than hyphen delimited, you can modify @hwnd's solution as follows:

text.match(/[^*"\s?!\(\)]*[^*"\s?!.,\(\)]/g);

Periods (e.g., at the end of a sentence) won't be included in words, but words such as domains like stackoverflow.com that contain dots within them will be returned as a single word.

Double quotes are automatically ignored. Single quotes could also be ignored, but only by losing the ability to recognize words like 'Tis (as in 'Tis so sweet to trust in Jesus...) or possessives like students'. Perfectly parsing all words requires a bit of comprehension beyond a regular expression's capabilities, but either one of these solutions will do the job rather well in most cases.

The following regular expression works even better for English, except since JavaScript doesn't support locale for \w, I would be careful using it in any potentially internationalized contexts.

/[^\s!"<>\(\)\[\]\{\}?`]*[\w']/g

(For example, it recognizes every word in this answer properly, except for "e.g." on which it mistakenly thinks the trailing . is a period and ignores it.)

This final RE will work in any language just as well:

[^\s!"<>\(\)\[\]\{\}?`]*[^\s!"<>\(\)\[\]\{\}?`.,:]

Upvotes: 0

hwnd
hwnd

Reputation: 70732

Another way to do this would be a negated match. You can add what you don't want to match inside of the character class. The caret ^ inside of a character class [] in considered the negation operator.

var text = "I'd like to make it work."
var words = text.match(/[^\s?!.]+/g);
console.log(words); // => [ 'I'd', 'like', 'to', 'make', 'it', 'work' ]

Regular expression:

[^\s?!.]+     any character except: whitespace (\n, \r, \t, \f, and " "), 
              '?', '!', '.' (1 or more times)

Upvotes: 1

melvas
melvas

Reputation: 2356

var text = "I'd like to make it work."
var words = text.split(' ');

returns ["I'd", "like", "to", "make", "it", "work."]

EDITED

I'm sorry, ChiChou was right in his comment

var words = text.match(/[A-Za-z0-9_\']+/g);

It works like expected

Upvotes: 0

Related Questions