Reputation: 69
I need to use a regular expression to extract a part of a string variable. My data looks like this where a represents alpha characters x and Z represent numerical characters. I want to extract the Z characters before the "-"
var1
"aaa xxx xxx ZZZ-ZZZ-a"
"aaa xx xxx ZZZ-ZZ"
My code looks like this
gen p_id = regexs(1) if regexm(var1, "([0-9][0-9][0-9])[-]*[0-9][0-9][-]*[ a-zA-Z]*$"
)
This code extracts more than what is required. For example, this code extracts the numerical portion from an observation that looks like this. specifically it extracts ZZZ
var1
"aaa ZZZZZ aaa"
I played around with expressions but cannot get the required answer.
Upvotes: 0
Views: 338
Reputation: 1
Try
gen var2=regexs(1) if regexm(var1,"([0-9]+)[-]*([0-9]+)[-]*([0-9]+)[-]?([a-z]*$
keep changing regexs(1) to regexs(2) and regexs(3) along with variable name to generate other numbers before (-)
Upvotes: 0
Reputation: 37368
As often seems to happen, deciding in advance that the solution must be based on regular expressions just complicates your code. From your description you need the three characters before the first "-". That would be
gen p_id = substr(var1, strpos(var1, "-") - 3, 3)
Test example:
clear
input str21 var1
"aaa xxx xxx 123-ZZZ-a"
"aaa xx xxx 567-ZZ"
end
gen p_id = substr(var1, strpos(var1, "-") - 3, 3)
list
+------------------------------+
| var1 p_id |
|------------------------------|
1. | aaa xxx xxx 123-ZZZ-a 123 |
2. | aaa xx xxx 567-ZZ 567 |
+------------------------------+
Upvotes: 3
Reputation: 11112
I think you need to better describe the structure of the values that can be present. But how about:
clear
set more off
input ///
str30 x
"aaa 736 058 123-456-a"
"aaa 11 688 789-01"
"aaa 56789 aaa"
end
// original
gen p_id = regexs(1) ///
if regexm(x, "([0-9][0-9][0-9])[-]*[0-9][0-9][-]*[ a-zA-Z]*$")
// modified
gen p_id2 = regexs(1) ///
if regexm(x, "([0-9]*[-][0-9]*)")
list p_id*
?
Upvotes: 2