Stata split string into parts

Question

I have a string variable (col1) that I want to split at the first occurrence of an integer, i.e. generate variables part1 and part2.

               col1                part1    part2
--------------------------------------------------
    AufderScholle12        AufderScholle       12
Kˆnigsbr¸ckerPlatz3   Kˆnigsbr¸ckerPlatz        3
         Hansastr0A             Hansastr       0A
             Flur:3                Flur:        3

I could not figure out yet how to implement that using regex expressions from various articles on this matter.

Roberto Ferrer · Accepted Answer

The following works for your example data, but notice I had to insert the "non-conventional" characters inside the regex definition because I don't see a way of expressing "all but numbers" using Stata's implementation of regex:

clear
set more off

*----- example data -----

input ///
str30 orig              
"AufderScholle12"       
"K^nigsbr¸ckerPlatz3"   
"Hansastr0A"          
"Flur:3"
end

list

*----- what you want -----

gen p1 = regexs(1) if(regexm(orig, "([\-\^\¸\:a-zA-Z]*)([0-9]?.*)"))
gen p2 = regexs(2) if(regexm(orig, "([\-\^\¸\:a-zA-Z]*)([0-9]?.*)"))

list

The regex experts can take a look at Stata's implementation (a very simple one) here:

http://www.stata.com/support/faqs/data-management/regular-expressions/

to check for a better way.

According to Stata's help regex

Regular expression syntax is based on Henry Spencer's NFA algorithm, and this is nearly identical to the POSIX.2 standard.

A solution I'm more confident in uses string functions:

clear
set more off

*----- example data -----

input ///
str30 orig              
"AufderScholle12"       
"K^nigsbr¸ckerPlatz3"   
"Hansastr0A"          
"Flur:3"
end

list

*----- what you want -----

forvalues i = 0/9 {
    gen p_`i' = strpos(orig, "`i'")
    replace p_`i' = . if p_`i' == 0
}

egen fpos = rowmin(p*)

gen p1 = substr(orig, 1, fpos-1)
gen p2 = substr(orig, fpos, .)

drop fpos p_*
list

This just finds the position where the first numeric character occurs and uses that to single out substrings from the original text.

See help string functions.

Edit

One way of expressing "all but numbers" is [^0-9]*, so the following would give the same results as the original:

gen p3 = regexs(1) if(regexm(orig, "([^0-9]*)([0-9]?.*)"))
gen p4 = regexs(2) if(regexm(orig, "([^0-9]*)([0-9]?.*)"))

Stata split string into parts

Answers (2)

Edit

Related Questions