Fuca26
Fuca26

Reputation: 235

Generate dummies from list of strings contained in one variable

I have a string variable that tells me what insurances are accepted by individual doctors.

Example data for the first five doctors in the dataset:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long ID_MHP str376 Insurance
162582 "||Aetna|||Anthem|||BlueCrossandBlueShield|||Cigna|||Humana|||Magellan|||Optum|||UMR|||UnitedHealthcare||"                                                                                                                                 
290515 "||Aetna|||Allegiance|||BlueCross|||BlueShield|||BlueCrossandBlueShield|||Cigna|||FirstChoiceHealth|||Kaiser(Out-of-Network)|||MHNetBehavioralHealth|||Magellan|||Medicaid|||OtherInsuranceEBMS|||PacificSource|||TriWest|||OutofNetwork||"
281539 "||Aetna|||BlueCrossandBlueShield|||CHIP|Children'sProgram|||Cigna|||Medicaid|||Medicare|||UnitedHealthCare|||OutofNetwork||"                                                                                                              
387593 "||TRICARE||"                                                                                                                                                                                                                              
108333 "||OutofNetwork||"                                                                                                                                                                                                                         
end

So, each doctor accepts different brands and quantity of insurances.

I want to create a dummy variable for every insurance and the result should look as follows:

  +-----------------------------+
  | ID_MHP   Aetna   Allegiance |
  |-----------------------------|
  | 162582       1            0 |
  | 290515       1            1 |
  | 281539       1            0 |
  | 387593       0            0 |
  | 108333       0            0 |
  +-----------------------------+

That is, for each insurance brand accepted by these five doctors, there should be a separate dummy variable. Here I report only the first two insurances in alphabetical order.

I believe that a first step should be the creation of the list of all of the possible insurances that are accepted by the doctors in my dataset.

For these five doctors, I should get a list that looks like the one below:

Aetna
Allegiance
Anthem
BlueCross
BlueCrossandBlueShield
BlueShield
Children'sProgram
CHIP
Cigna
FirstChoiceHealth
Humana
Kaiser(Out-of-Network)
Magellan
Medicaid
Medicare
MHNetBehavioralHealth
Optum
OtherInsuranceEBMS
OutofNetwork
PacificSource
TRICARE
TriWest
UMR
UnitedHealthcare

Is there a way to create this list other than by hand?

After I have created this list I can create a dummy for each insurance company:

generate Aetna = .
replace Aetna = 0 if Insurance!="NA"
replace Aetna = 1 if  regexm(Insurance, "(Aetna)")

Can I replace Aetna=0 if "insurance does not contain Aetna" with the function regexm()?

For example:

replace Aetna = 0 if regexm!(Insurance, "(Aetna)")

Cross-posted on Statalist.

Upvotes: 0

Views: 458

Answers (1)

user8682794
user8682794

Reputation:

The following works for me:

preserve

replace Insurance = subinstr(Insurance, "|", " ", .)
split Insurance
drop Insurance

reshape long Insurance, i(ID_MHP) j(_j)
levelsof Insurance, clean

restore

foreach x in `r(levels)' {
    capture generate `x' = strmatch(Insurance, "*`x'*")
    if _rc {
        local modified `modified' `x'
        local varname = substr(ustrregexra("`x'", "[^a-zA-Z]" ,"_"), 1, 32) // max 32 chars
        generate `varname' = strmatch(Insurance, "*`x'*")
        local new `new' `varname'
    }           
}

display "`modified'"
Children'sProgram Kaiser(Out-of-Network)

display "`new'"
Children_sProgram Kaiser_Out_of_Network_

Upvotes: 1

Related Questions