Rodrigo
Rodrigo

Reputation: 69

Stata regular expressions

I need to use a regular expression to extract a part of a string variable. My data looks like this where a represents alpha characters x and Z represent numerical characters. I want to extract the Z characters before the "-"

   var1
    "aaa xxx xxx ZZZ-ZZZ-a"
    "aaa xx xxx ZZZ-ZZ"

My code looks like this

gen p_id = regexs(1) if regexm(var1, "([0-9][0-9][0-9])[-]*[0-9][0-9][-]*[ a-zA-Z]*$")

This code extracts more than what is required. For example, this code extracts the numerical portion from an observation that looks like this. specifically it extracts ZZZ

var1
"aaa ZZZZZ aaa"

I played around with expressions but cannot get the required answer.

Upvotes: 0

Views: 338

Answers (3)

Rodgers
Rodgers

Reputation: 1

Try

gen  var2=regexs(1) if regexm(var1,"([0-9]+)[-]*([0-9]+)[-]*([0-9]+)[-]?([a-z]*$

keep changing regexs(1) to regexs(2) and regexs(3) along with variable name to generate other numbers before (-)

Upvotes: 0

Nick Cox
Nick Cox

Reputation: 37368

As often seems to happen, deciding in advance that the solution must be based on regular expressions just complicates your code. From your description you need the three characters before the first "-". That would be

 gen p_id = substr(var1, strpos(var1, "-") - 3, 3) 

Test example:

 clear 

 input str21 var1
 "aaa xxx xxx 123-ZZZ-a"
 "aaa xx xxx 567-ZZ"
 end 

 gen p_id = substr(var1, strpos(var1, "-") - 3, 3) 

 list 

     +------------------------------+
     |                  var1   p_id |
     |------------------------------|
  1. | aaa xxx xxx 123-ZZZ-a    123 |
  2. |     aaa xx xxx 567-ZZ    567 |
     +------------------------------+

Upvotes: 3

Roberto Ferrer
Roberto Ferrer

Reputation: 11112

I think you need to better describe the structure of the values that can be present. But how about:

clear
set more off

input ///
str30 x
"aaa 736 058 123-456-a"
"aaa 11 688 789-01"
"aaa 56789 aaa"
end

// original
gen p_id = regexs(1) ///
    if regexm(x, "([0-9][0-9][0-9])[-]*[0-9][0-9][-]*[ a-zA-Z]*$")

// modified
gen p_id2 = regexs(1) ///
    if regexm(x, "([0-9]*[-][0-9]*)")

list p_id*

?

Upvotes: 2

Related Questions