Reputation: 618
I have a string variable (col1
) that I want to split at the first occurrence of an integer, i.e. generate variables part1
and part2
.
col1 part1 part2
--------------------------------------------------
AufderScholle12 AufderScholle 12
Kˆnigsbr¸ckerPlatz3 Kˆnigsbr¸ckerPlatz 3
Hansastr0A Hansastr 0A
Flur:3 Flur: 3
I could not figure out yet how to implement that using regex expressions from various articles on this matter.
Upvotes: 0
Views: 5080
Reputation: 11112
The following works for your example data, but notice I had to insert the "non-conventional" characters inside the regex definition because I don't see a way of expressing "all but numbers" using Stata's implementation of regex:
clear
set more off
*----- example data -----
input ///
str30 orig
"AufderScholle12"
"K^nigsbr¸ckerPlatz3"
"Hansastr0A"
"Flur:3"
end
list
*----- what you want -----
gen p1 = regexs(1) if(regexm(orig, "([\-\^\¸\:a-zA-Z]*)([0-9]?.*)"))
gen p2 = regexs(2) if(regexm(orig, "([\-\^\¸\:a-zA-Z]*)([0-9]?.*)"))
list
The regex experts can take a look at Stata's implementation (a very simple one) here:
http://www.stata.com/support/faqs/data-management/regular-expressions/
to check for a better way.
According to Stata's help regex
Regular expression syntax is based on Henry Spencer's NFA algorithm, and this is nearly identical to the POSIX.2 standard.
A solution I'm more confident in uses string functions:
clear
set more off
*----- example data -----
input ///
str30 orig
"AufderScholle12"
"K^nigsbr¸ckerPlatz3"
"Hansastr0A"
"Flur:3"
end
list
*----- what you want -----
forvalues i = 0/9 {
gen p_`i' = strpos(orig, "`i'")
replace p_`i' = . if p_`i' == 0
}
egen fpos = rowmin(p*)
gen p1 = substr(orig, 1, fpos-1)
gen p2 = substr(orig, fpos, .)
drop fpos p_*
list
This just finds the position where the first numeric character occurs and uses that to single out substrings from the original text.
See help string functions
.
One way of expressing "all but numbers" is [^0-9]*
, so the following would give the same results as the original:
gen p3 = regexs(1) if(regexm(orig, "([^0-9]*)([0-9]?.*)"))
gen p4 = regexs(2) if(regexm(orig, "([^0-9]*)([0-9]?.*)"))
Upvotes: 3
Reputation: 37368
This isn't a complete answer, just a footnote to @Roberto Ferrer's useful answer that would not go well as a comment.
Another way to find the position of the first integer, without creating 10 new variables and then firing up egen
:
gen posint = .
quietly forval i = 0/9 {
replace posint = min(posint, strpos(orig, "`i'"))
}
Upvotes: 1