Reputation: 1549
I am using stata, and have a variable called "practice" which has a list of practices and their 5 character code inside parenthesis.
I want to extract the code part only into a new variable. Here is example of what the data in variable "practice" looks like:
practice 1 name (JRX76)
practice 2 name but longer (XN6S1)
practice 3 name (4NB87)
practice 4 name but longer (north) (RS236)
practice 5 name (WSZ92)
I have used the following code so far:
gen code=regexs(2) if regexm(practice, "(\()+([a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9])")
Which works perfectly, except for on data in the format of the "practice 4" above, for which it extracts "north" rather than "RS236".
I have tried playing around with the $ symbol, but to no success.
I have also not worked out how to combine 'if' statements with regexs, along the lines of the logic "if you find 2 '(', take the 5 character expression after the second '('".
Would anyone be able to point me in the right direction on this please?
Upvotes: 0
Views: 1714
Reputation: 24812
I'd guess you forgot to take the trailing parenthesis into account when you tried to add the "end-of-string" $
symbol. To keep it as close as your current regex as possible, I would suggest this one :
(\()+([a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9])(\))+$
Now there are a few improvements I would suggest :
+
quantifier around the parenthesis if they occur precisely one timeStata
supports lookarounds, they could simplify your codeSo you could try using this one with lookarounds :
(?<=\()[a-zA-Z0-9]{5}(?=\)$)
Or this one without :
\(([a-zA-Z0-9]{5})\)$
Upvotes: 1
Reputation: 48711
You don't need to capture parenthesis:
([a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9])(?=\)$)
I removed beginning pattern (\()+
and added (?=\)$)
to the end which means to look for a literal )
that's coming at the end of line.
Upvotes: 1