Jhonny
Jhonny

Reputation: 618

Stata split string into parts

I have a string variable (col1) that I want to split at the first occurrence of an integer, i.e. generate variables part1 and part2.

               col1                part1    part2
--------------------------------------------------
    AufderScholle12        AufderScholle       12
Kˆnigsbr¸ckerPlatz3   Kˆnigsbr¸ckerPlatz        3
         Hansastr0A             Hansastr       0A
             Flur:3                Flur:        3

I could not figure out yet how to implement that using regex expressions from various articles on this matter.

Upvotes: 0

Views: 5080

Answers (2)

Roberto Ferrer
Roberto Ferrer

Reputation: 11112

The following works for your example data, but notice I had to insert the "non-conventional" characters inside the regex definition because I don't see a way of expressing "all but numbers" using Stata's implementation of regex:

clear
set more off

*----- example data -----

input ///
str30 orig              
"AufderScholle12"       
"K^nigsbr¸ckerPlatz3"   
"Hansastr0A"          
"Flur:3"
end

list

*----- what you want -----

gen p1 = regexs(1) if(regexm(orig, "([\-\^\¸\:a-zA-Z]*)([0-9]?.*)"))
gen p2 = regexs(2) if(regexm(orig, "([\-\^\¸\:a-zA-Z]*)([0-9]?.*)"))

list

The regex experts can take a look at Stata's implementation (a very simple one) here:

http://www.stata.com/support/faqs/data-management/regular-expressions/

to check for a better way.

According to Stata's help regex

Regular expression syntax is based on Henry Spencer's NFA algorithm, and this is nearly identical to the POSIX.2 standard.

A solution I'm more confident in uses string functions:

clear
set more off

*----- example data -----

input ///
str30 orig              
"AufderScholle12"       
"K^nigsbr¸ckerPlatz3"   
"Hansastr0A"          
"Flur:3"
end

list

*----- what you want -----

forvalues i = 0/9 {
    gen p_`i' = strpos(orig, "`i'")
    replace p_`i' = . if p_`i' == 0
}

egen fpos = rowmin(p*)

gen p1 = substr(orig, 1, fpos-1)
gen p2 = substr(orig, fpos, .)

drop fpos p_*
list

This just finds the position where the first numeric character occurs and uses that to single out substrings from the original text.

See help string functions.

Edit

One way of expressing "all but numbers" is [^0-9]*, so the following would give the same results as the original:

gen p3 = regexs(1) if(regexm(orig, "([^0-9]*)([0-9]?.*)"))
gen p4 = regexs(2) if(regexm(orig, "([^0-9]*)([0-9]?.*)"))

Upvotes: 3

Nick Cox
Nick Cox

Reputation: 37368

This isn't a complete answer, just a footnote to @Roberto Ferrer's useful answer that would not go well as a comment.

Another way to find the position of the first integer, without creating 10 new variables and then firing up egen:

gen posint = . 
quietly forval i = 0/9 { 
    replace posint = min(posint, strpos(orig, "`i'")) 
} 

Upvotes: 1

Related Questions