user23332575
user23332575

Reputation: 21

Stata: How do I check whether the value of one variable exists somewhere in the column of another variable?

I have two variables in Stata which are string (MAIDCW1 and strboth).

I would like to check for each observation whether the value of MAIDCW1 exists somewhere in the whole of strboth. If this value exists in the variable, I would like to set the variable TreatmentFinal to 1, otherwise to 0.

I tried the following code variants but I always receive the error code "invalid syntax":

**Variant 1:**

gen TreatmentFinal = 0
forvalues i = 1 / _N {

    quietly count if strboth == MAIDCW1
    
    if r(N) > 0 {
        replace TreatmentFinal = 1 in `i'
    } else {
        replace TreatmentFinal = 0 in `i'
    }
}


**Variant 2:**

gen TreatmentFinal = 0
forvalues i = 1 / _N {

    qui gen match = cond(index(strboth, MAIDCW1[`i']) > 0, 1, 0)
    
    replace TreatmentFinal = match in `i'
   
    drop match
}

I also tried to first format strboth and MAIDCW1 as strings, but this does not change the result and I still receive the error code "invalid syntax".

Addendum:

1.) I use an Excel dataset and randomize a market ID to a treatment (1) & control group (0). So, the variable Treatment is either 1 or 0.

2.) I merge the market ID (e.g., 046914YBB) and Treatment which becomes strboth (e.g., 046914YBB1).

3.) Now I look at a different column which has several entries of the same market ID (for different calender weeks) and add "1" to all of them in order to compare it to strboth, which would indicate whether it belongs to the treatment group (as the market ID entry in strboth also has a "1" at the end) or to the control group (if the entry has a "0" at the end). This variable then is MAIDCW1. Entries with the same marked IDs should always be in the same group (either treatment or control).

4.) Then, for all entries of MAIDCW1 I check whether the value also exists in strboth in order to define for all market IDs of the different calendar weeks whether they belong to the treatment or control group (TreamtentFinal). -> And here the problem of Nick's code appears: The same entries of market IDs for different calendar weeks (entries in MAIDCW1) receive different values in TreatmentFinal (0 or 1) even though they all should have the same value.

Here is the complete code including Nick’s suggestion:

clear
  import excel "Example.xlsx", sheet("Example") firstrow

set seed 123
bysort MarktName: gen double rand1=rnormal() if _n==1
egen max1=max(rand1), by(MarktName)
replace rand1=max1

bysort AbteilungName: gen double rand2=rnormal() if _n==1
egen max2=max(rand2), by(AbteilungName)
replace rand2=max2

sort rand1 rand2

// Randomization pattern
gen Randomisierung = 

"111000011100001110000111100011110001111000011100001110000111100011"

// Assignment to treatment (1) or control group (0) 
gen Treatment = substr(Randomisierung, _n, 1)

drop Randomisierung

//Here the relevant part begins:

gen TreatmentFinal = 0 
gen strboth = MAID + Treatment
gen long obsno = _n 
egen long group = group(strboth) 
su group, meanonly  
local G = r(max) 

quietly forval g = 1/`G' { 
    su obsno if group == `g', meanonly 
    count if strboth[r(min)] == MAIDCW1 
    if r(N) > 0 replace TreatmentFinal = 1 if group == `g' 
}

Upvotes: 2

Views: 334

Answers (1)

Nick Cox
Nick Cox

Reputation: 37183

This might work. I doubt it's very good Stata style, and someone else may think of a better approach, but let's get you closer to something that works. Some people would see this as a problem for merge.

What we should do is loop over the distinct values of strboth and NOT repeat comparisons for each observation with the same value.

EDIT: Note the emphasis here. If 42 occurs repeatedly in one variable, then checking whether it occurs in the other variable need only be done once. You should read the help for group() in help egen: it maps the distinct values of its argument(s) to integers 1 up. So distinct strings frog newt toad would be mapped to 1 2 3.

gen TreatmentFinal = 0 
gen long obsno = _n 
egen long group = group(strboth) 
su group, meanonly  
local G = r(max) 

quietly forval g = 1/`G' { 
    su obsno if group == `g', meanonly 
    count if strboth[r(min)] == MAIDCW1 
    if r(N) > 0 replace TreatmentFinal = 1 if group == `g' 
}

EDIT: Don't be misled by the option name meanonly. summarize is used here to get the minimum as a safe way to look up the value of one variable indexed by each distinct value of group.

Mistakes in your code:

  1. forvalues won't evaluate _N on the fly.

  2. Looping over observations doesn't itself imply focus on a particular observation each time around the loop.

  3. The display format of strings is not relevant to this problem.

Other points:

  1. There is no need to replace 0 with 0.

cond(index(strboth, MAIDCW1[`i']) > 0, 1, 0)

is equivalent to

index(strboth, MAIDCW1[`i']) > 0`

Upvotes: 0

Related Questions