How to generate a dummy variable in Stata based on a sub-string of an existing string variable?

Question

I am looking for a way to create a dummy variable which checks a variable called text against multiple given substrings like "book, buy, journey".

Now, I want to check if a observation has either book, buy, or journey in it. If there is one of these keywords found in the substring then the dummy variable should be 1, otherwise 0. A example:

                 TEXT
Book your tickets now
Swiss is making your journey easy
Buy your holiday tickets now!
A touch of Austria in your lungs.

The desired outcome should be

dummy variable
       1
       1
       1
       0

I tried it with strpos and also regexm with very limited results.

Regards,

Johi

Wouter · Accepted Answer

Using strpos may be tedious because you have to take capitalization into account, so I would use regular expressions.

* Example generated by -dataex-. To install: ssc install dataex
clear
input str33 text
"Book your tickets now"            
"Swiss is making your journey easy"
"Buy your holiday tickets now!"    
"A touch of Austria in your lungs."
end

generate wanted = regexm(text, "[Bb]ook|[Bb]uy|[Jj]ourney")
list

Result:

. list

     +--------------------------------------------+
     |                              text   wanted |
     |--------------------------------------------|
  1. |             Book your tickets now        1 |
  2. | Swiss is making your journey easy        1 |
  3. |     Buy your holiday tickets now!        1 |
  4. | A touch of Austria in your lungs.        0 |
     +--------------------------------------------+

See also this link for info on regular expressions.

How to generate a dummy variable in Stata based on a sub-string of an existing string variable?

Answers (1)

Related Questions