Yolo_chicken
Yolo_chicken

Reputation: 1381

Stata Regex for 'standalone' numbers in string

I am trying to remove a specific pattern of numbers from a string using the regexr function in Stata. I want to remove any pattern of numbers that are not bounded by a character (other than whitespace), or a letter. For example, if the string contained t370 or 6-test I would want those to remain. It's only when I have numbers next to each other.

clear 
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end

I would like to end up with:

ID     string
1      7-test
2      67-tty
3      j37b2 3hty

I've tried different regex statements to find when numbers are wrapped in a word boundary: regexr(string, "\b[0-9]+\b", ""); in addition to manually adding the white space " [0-9]+" which will only replace if the pattern occurs in the middle, not at the start of a string. If it's easier to do this without regex expressions that's fine, I was just trying to become more familiar.

Upvotes: 1

Views: 441

Answers (1)

JR96
JR96

Reputation: 973

Following up on the loop suggesting from the comments, you could do something like the following:

clear 
input id str40 string
1 "9884 7-test 58 - 489"
2 "67-tty 783 444"
3 "j3782 3hty"
end

gen N_words = wordcount(string) // # words in each string
qui sum N_words 
global max_words = r(max)  // max # words in all strings

split string, gen(part) parse(" ") // split string at space (p.s. space is the default)

gen string2 = ""
forval i = 1/$max_words {
    * add in parts that contain at least one letter
    replace string2 = string2 + " " + part`i' if regexm(part`i', "[a-zA-Z]") & !missing(string2)
    replace string2 = part`i' if regexm(part`i', "[a-zA-Z]") & missing(string2)
}

drop part* N_words

where the result would be

. list

     +----------------------------------------+
     | id                 string      string2 |
     |----------------------------------------|
  1. |  1   9884 7-test 58 - 489       7-test |
  2. |  2         67-tty 783 444       67-tty |
  3. |  3             j3782 3hty   j3782 3hty |
     +----------------------------------------+

Note that I have assumed that you want all words that contain at least one letter. You may need to adjust the regexm here for your specific use case.

Upvotes: 3

Related Questions