vonjd
vonjd

Reputation: 4380

Regular expression in R behaves differently than in other languages

The regular expression pattern ^[A-Z]{2,4}$ specifies that the string to be matched should start with an uppercase letter and end with an uppercase letter. It also requires that there be exactly two, three, or four letters present. Anything else will not be considered valid:

filter_symbols <- function(symbols) {
  valid <- regexpr("^[A-Z]{2,4}$", symbols)
  return(sort(symbols[valid == 1]))
  #valid
}
filter_symbols(c("MOT", "CVX", "123", "GOG2", "XLE", "AAPL", "AAPLS", "A"))

...and it works like a charm:

[1] "AAPL" "CVX"  "MOT"  "XLE" 

Now when you test the same code here (and there are many similar online regex tester out there):

^[A-Z]{2,4}$

Regular expression visualization

Debuggex Demo

...you don't get any match (neither when you start the words in new lines each) - why is it behaving differently in both cases?

Upvotes: 1

Views: 88

Answers (2)

hwnd
hwnd

Reputation: 70722

In Debuggex, no match results yield because you don't have the correct modifier turned on.

In most all regular expression engines, the anchors ^ and $ only match (respectively) at the beginning and the end of the string by default. If you want to match the begin/end of each line (not only begin/end of string), turn on the m (multi-line) modifier which causes this behavior.

You can see the difference with this mode modifier being turned on — Debuggex Demo

Upvotes: 2

Lucas Trzesniewski
Lucas Trzesniewski

Reputation: 51330

By default, ^ matches at the start of the string, and $ matches only at the end.

Debbugex and other related sites pass the whole input textarea as a single input string, so your regex actually was being matched against MOT\ncvx\n123...AAPL.

Enable the m (multiline) flag - in this mode, ^ and $ will match the start/end of each line and it will enable you to test multiple inputs.

See the updated debuggex demo

Upvotes: 2

Related Questions