Reputation: 11
I'm currently using regular expressions to manipulate street names in Stata. I'm faced with a problem that requires me to select observations based on how long a certain word is in the string. I know that you can specify the iteration of expressions using curved brackets in other engines, but this doesn't seem to be working in Stata. Specifically, I want to select observations that have three or more alpha numeric characters in a certain point in the string, which should be coded by
[a-zA-Z0-9]{3,}
However, this doesn't work when I try it in Stata, nor do any other uses of {} work, even though online debuggers say it should be correct. Is this a deficiency in the Stata implementation of regex? I'm working on a solution that doesn't need that iteration, but I'd like to hear from the community on what is lacking in regex in Stata, and if there's a different way to iterate expressions in the program.
Upvotes: 1
Views: 336
Reputation: 9460
I think the new Unicode regexp parser in Stata 14 (based on the ICU standard) can use this notation to find patterns that repeat at least k times:
clear
input str50 address
"221B Baker Street"
"56B, Whitehaven Mansions"
"Danemead, High street, St. Mary Mead"
end
compress
list address if ustrregexm(address,"([0-9]){3,}")
This will only give you Sherlock's address since it has 3 or more numbers. It also looks like you can use character classes:
list address if ustrregexm(address,"([:digit:]){3,}")
The regular regexp parser has never supported this shortcut capacity.
Upvotes: 1
Reputation: 626870
There is no limiting quantifier in Stata according to the documentation.
Other popular regular-expression syntaxes include the POSIX standard and Perl’s standard. Both expand on these basic operators by including counting operators (use of curly braces), metacharacters (usually of the form :alpha:, etc.), and other syntax-specific additions.
When presented with the choice of which regular-expression syntax to adopt, Stata has several options. Different operating systems offer their own regular-expression parsers for applications to use, but there is no guarantee that these parsers are consistent. Stata avoids this ambiguity by using its own parser.
You just need to repeat the subpattern "manually" (as in some examples on the documentation Web page):
[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]+
Upvotes: 0