Timothy Alston
Timothy Alston

Reputation: 1549

extracting regular expressions (regexs) in Stata

I am using stata, and have a variable called "practice" which has a list of practices and their 5 character code inside parenthesis.

I want to extract the code part only into a new variable. Here is example of what the data in variable "practice" looks like:

practice 1 name (JRX76)
practice 2 name but longer (XN6S1)
practice 3 name (4NB87)
practice 4 name but longer (north) (RS236)
practice 5 name (WSZ92)

I have used the following code so far:

gen code=regexs(2) if regexm(practice, "(\()+([a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9])")

Which works perfectly, except for on data in the format of the "practice 4" above, for which it extracts "north" rather than "RS236".

I have tried playing around with the $ symbol, but to no success.

I have also not worked out how to combine 'if' statements with regexs, along the lines of the logic "if you find 2 '(', take the 5 character expression after the second '('".

Would anyone be able to point me in the right direction on this please?

Upvotes: 0

Views: 1714

Answers (2)

Aaron
Aaron

Reputation: 24812

I'd guess you forgot to take the trailing parenthesis into account when you tried to add the "end-of-string" $ symbol. To keep it as close as your current regex as possible, I would suggest this one :

(\()+([a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9])(\))+$

Now there are a few improvements I would suggest :

  • there's no need to use a "at least one time" + quantifier around the parenthesis if they occur precisely one time
  • there's no need to add a group around the parenthesis
  • if Stata supports lookarounds, they could simplify your code
  • don't repeat yourself : use quantifiers

So you could try using this one with lookarounds :

(?<=\()[a-zA-Z0-9]{5}(?=\)$)

Or this one without :

\(([a-zA-Z0-9]{5})\)$

Upvotes: 1

revo
revo

Reputation: 48711

You don't need to capture parenthesis:

([a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9])(?=\)$)

I removed beginning pattern (\()+ and added (?=\)$) to the end which means to look for a literal ) that's coming at the end of line.

Upvotes: 1

Related Questions