Sasha Grievus
Sasha Grievus

Reputation: 2686

Matching words as separate strings unless they start with a capital letter

I have this regexp

/[A-Za-zÀ-ÿ]+/g

that matches 'words' composed by characters of unlimited lenght.

If I do want to exclude words starting with a capital letter?

I tried

/(^[A-Z])[A-Za-zÀ-ÿ]+/g

but it doesn't seems to work. Can't use things like /w for it doesn't include diacritics.

EDIT: the language in use is Typescript so the javascript engine (which doesn't allow lookbehind, for example) Sorry for not mention this.

EDIT: the input given can be something like

"foo"            //should match foo and return true
"Foo"            //should not match foo and return false
"fòo"            //should match fòo and return true
" "              //should not match foo and return false
"."              //should not match foo and return false
","              //should not match foo and return false

Code (Typescript) matching without the capital letter thing

isProperWord(word){
    /* rejects
      - string that are not words (symbols, spaces, etc...)
      - names (words starting with a capital letter)
    */
    if(word.match(/[A-Za-zÀ-ÿ]+/g)){
      return true;
    }else{
      return false;
    }

}

Upvotes: 0

Views: 179

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627419

To match all capital letters from your initial range, you may use [A-ZÀ-ÖØ-Þ] character class. To match all lowercase letters, [a-zß-öø-ÿ]. Note that × and ÷ are not letters, I removed them from these classes.

To make sure the whole string consists of these letters only, and the first char is not an uppercase letter, use

/^[a-zß-öø-ÿ][A-Za-zÀ-ÖØ-öø-ÿ]*$/

See the regex demo.

JS demo:

var strs = ['foo','fòo','Foo',' ','.',','];
var rx = /^[a-zß-öø-ÿ][A-Za-zÀ-ÖØ-öø-ÿ]*$/;
for (var s of strs) {
  console.log(s,"=>",rx.test(s));
}

To extract words, use custom boundaries:

var s = 'foo,fòo,Foo';
var rx = /(?:[^A-Za-zÀ-ÖØ-öø-ÿ]|^)([a-zß-öø-ÿ][A-Za-zÀ-ÖØ-öø-ÿ]*)(?![A-Za-zÀ-ÖØ-öø-ÿ])/g;
var m, res=[];
while(m=rx.exec(s)) {
  res.push(m[1]);
}
console.log(res);

Upvotes: 1

tripleee
tripleee

Reputation: 189830

The expression ^[A-Z] means match an uppercase character at the beginning of line. You probably tried to type [^A-Z] which matches a character which is not an uppercase alphabetic between A and Z, but that still doesn't help, because the regex engine will find a character somewhere which matches this, and be satisified. (For example, a space trivially matches this -- it's a character, and it's not in the range A through Z.)

If you use a regex dialect which understands word boundaries with \b, try

/\b[a-z][A-Za-z]*/

to match a token which has a word boundary on its left, and a lowercase character adjacent to it. (I am ignoring your locale extension, which is not portable and possibly not well-defined.)

In isolation, the /g flag doesn't do anything. If you have a language which supports it, and use a regex in a while loop or similar, it will cause the engine to return all the matches in the string, one at a time, inside the loop; but without further context, we have no idea whether that is actually true here.

Upvotes: 3

Related Questions