b_surial
b_surial

Reputation: 562

Extract string pattern in new variable based on other string variable

Consider the following variable:

clear

input str18 string
"abc bcd cde"        
"def efg fgh"
"ghi hij ijk"    
end

I can use the regexm() function to extract all occurrences of abc, cde and def:

generate new = regexm(string, "abc|cde|def")

list

|string          new |
|--------------------|
|  abc bcd cde     1 |
|  def efg fgh     1 |
|  ghi hij ijk     0 |

How can I get the following?

|string            wanted  |
|--------------------------|
|  abc bcd cde     abc cde |
|  def efg fgh     def     |
|  ghi hij ijk             |

This question is an extension of the one answered here:

Upvotes: 1

Views: 283

Answers (2)

Nick Cox
Nick Cox

Reputation: 37208

I read this as your

  1. Having a list of allowed words.

  2. Wanting the words in a string that occur among the allowed words.

It is fashionable to seek a fancy regular expression solution for such problems, but your example at least yields to a plain loop over the words that exist. Be aware, however, that inlist() has advertised limits.

clear

input str18 string
"abc bcd cde"        
"def efg fgh"
"ghi hij ijk"    
end

generate wanted = "" 

generate wc = wordcount(string) 
summarize wc, meanonly 

quietly forvalues j = 1/`r(max)' { 
    replace wanted = wanted + " " + word(string, `j') if inlist(word(string, `j'), "abc", "cde", "def")
} 

replace wanted = trim(wanted) 

list 

     +----------------------------+
     |      string    wanted   wc |
     |----------------------------|
  1. | abc bcd cde   abc cde    3 |
  2. | def efg fgh       def    3 |
  3. | ghi hij ijk              3 |
     +----------------------------+

Upvotes: 2

user8682794
user8682794

Reputation:

This is the solution using a regular expression:

clear

input str18 string
"abc bcd cde"        
"def efg fgh"
"ghi hij ijk"    
end

generate wanted = ustrregexra(string, "(\b((?!(abc|cde|def))\w)+\b)", " ")  
replace wanted = strtrim(stritrim(wanted))

list

     +-----------------------+
     |      string    wanted |
     |-----------------------|
  1. | abc bcd cde   abc cde |
  2. | def efg fgh       def |
  3. | ghi hij ijk           |
     +-----------------------+

Upvotes: 1

Related Questions