Reputation: 185

How to remove everything but certain words in string variable (Stata)?

I have a string variable response, which contains text as well as categories that have already been coded (categories like "CatPlease", "CatThanks", "ExcuseMe", "Apology", "Mit", etc.). I would like to erase everything in response except for these previously coded categories.

For example, I would like response to change from:

"I Mit understand CatPlease read it again CatThanks"

to:

"Mit CatPlease CatThanks"

This seems like a simple problem, but I can't get my regex code to work perfectly. The code below attempts to store the categories in a variable cat_only. It only works if the category appears at the beginning of response. The local macro, cats, contains all of the words I would like to preserve in response:

local cats = "(CatPlease|CatThanks|ExcuseMe|Apology|Mit|IThink|DK|Confused|Offers|CatYG)?"

gen cat_only = strltrim(strtrim(ustrregexs(1)+" "+ustrregexs(2)+" "+ustrregexs(3))) if ustrregexm(response, "`cats'.+?`cats'.+?`cats'")

If I add characters to the beginning of the search pattern in ustrregexm, however, nothing will be stored in cat_only:

gen cat_only = strltrim(strtrim(ustrregexs(1)+" "+ustrregexs(2)+" "+ustrregexs(3))) if ustrregexm(response, ".+?`cats'.+?`cats'.+?`cats'")

Is there a way to fix my code to make it work, or should I approach the problem differently?

Upvotes: 1

Answers (3)

Bohemian

Reputation: 425448

Spaces can be handled using regex:

local words = "(?!CatPlease|CatThanks|ExcuseMe|Apology|Mit|IThink|DK|Confused|Offers|CatYG)\b\S+\b"
gen wanted = ustrregexra(response, "`words' | ?`words'", "")

This uses an alternation (a regex OR which is coded |) to match trailing/leading spaces, with the leading space being optional to handle when the entire input is one of the target words.

Upvotes: 1

Wouter

Reputation: 3271

* Example generated by -dataex-. To install: ssc install dataex
clear
input str50 response
"I Mit understand CatPlease read it again CatThanks"
end

local regex "(?!CatPlease|CatThanks|ExcuseMe|Apology|Mit|IThink|DK|Confused|Offers|CatYG)\b[^\s]+\b"
gen wanted = strtrim(stritrim(ustrregexra(response, "`regex'", "")))
list

. list

     +-------------------------------------------------------------------------------+
     |                                           response                     wanted |
     |-------------------------------------------------------------------------------|
  1. | I Mit understand CatPlease read it again CatThanks    Mit CatPlease CatThanks |
     +-------------------------------------------------------------------------------+

Upvotes: 2

Nick Cox

Reputation: 37368

I don't regard myself as fluent with Stata's regex functions, but this may be helpful:

. clear 

. set obs 1 
number of observations (_N) was 0, now 1

. gen test = "I Mit understand CatPlease read it again CatThanks"

. local OK "(CatPlease|CatThanks|ExcuseMe|Apology|Mit|IThink|DK|Confused|Offers|CatYG)"

. ssc install moss
. moss test, match("`OK'") regex 

. egen wanted = concat(_match*), p(" ")

. l wanted

     +-------------------------+
     |                  wanted |
     |-------------------------|
  1. | Mit CatPlease CatThanks |
     +-------------------------+

Upvotes: 2

How to remove everything but certain words in string variable (Stata)?

Answers (3)

Related Questions