Stata Regex for 'standalone' numbers in string

Question

I am trying to remove a specific pattern of numbers from a string using the regexr function in Stata. I want to remove any pattern of numbers that are not bounded by a character (other than whitespace), or a letter. For example, if the string contained t370 or 6-test I would want those to remain. It's only when I have numbers next to each other.

clear 
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end

I would like to end up with:

ID     string
1      7-test
2      67-tty
3      j37b2 3hty

I've tried different regex statements to find when numbers are wrapped in a word boundary: regexr(string, "\b[0-9]+\b", ""); in addition to manually adding the white space " [0-9]+" which will only replace if the pattern occurs in the middle, not at the start of a string. If it's easier to do this without regex expressions that's fine, I was just trying to become more familiar.

JR96 · Accepted Answer

Following up on the loop suggesting from the comments, you could do something like the following:

clear 
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end

gen N_words = wordcount(string) // # words in each string
qui sum N_words 
global max_words = r(max)  // max # words in all strings

split string, gen(part) parse(" ") // split string at space (p.s. space is the default)

gen string2 = ""
forval i = 1/$max_words {
    * add in parts that contain at least one letter
    replace string2 = string2 + " " + part`i' if regexm(part`i', "[a-zA-Z]") & !missing(string2)
    replace string2 = part`i' if regexm(part`i', "[a-zA-Z]") & missing(string2)
}

drop part* N_words

where the result would be

. list

     +----------------------------------------+
     | id                 string      string2 |
     |----------------------------------------|
  1. |  1   9884 7-test 58 - 489       7-test |
  2. |  2         67-tty 783 444       67-tty |
  3. |  3             j3782 3hty   j3782 3hty |
     +----------------------------------------+

Note that I have assumed that you want all words that contain at least one letter. You may need to adjust the regexm here for your specific use case.

Stata Regex for 'standalone' numbers in string

Answers (1)

Related Questions

Stata Regex for &#39;standalone&#39; numbers in string

Answers (1)

Related Questions

Stata Regex for 'standalone' numbers in string