Reputation: 21
I have two variables in Stata which are string (MAIDCW1
and strboth
).
I would like to check for each observation whether the value of MAIDCW1
exists somewhere in the whole of strboth
. If this value exists in the variable, I would like to set the variable TreatmentFinal
to 1, otherwise to 0.
I tried the following code variants but I always receive the error code "invalid syntax":
**Variant 1:**
gen TreatmentFinal = 0
forvalues i = 1 / _N {
quietly count if strboth == MAIDCW1
if r(N) > 0 {
replace TreatmentFinal = 1 in `i'
} else {
replace TreatmentFinal = 0 in `i'
}
}
**Variant 2:**
gen TreatmentFinal = 0
forvalues i = 1 / _N {
qui gen match = cond(index(strboth, MAIDCW1[`i']) > 0, 1, 0)
replace TreatmentFinal = match in `i'
drop match
}
I also tried to first format strboth
and MAIDCW1
as strings, but this does not change the result and I still receive the error code "invalid syntax".
Addendum:
1.) I use an Excel dataset and randomize a market ID to a treatment (1) & control group (0). So, the variable Treatment is either 1 or 0.
2.) I merge the market ID (e.g., 046914YBB) and Treatment which becomes strboth (e.g., 046914YBB1).
3.) Now I look at a different column which has several entries of the same market ID (for different calender weeks) and add "1" to all of them in order to compare it to strboth, which would indicate whether it belongs to the treatment group (as the market ID entry in strboth also has a "1" at the end) or to the control group (if the entry has a "0" at the end). This variable then is MAIDCW1. Entries with the same marked IDs should always be in the same group (either treatment or control).
4.) Then, for all entries of MAIDCW1 I check whether the value also exists in strboth in order to define for all market IDs of the different calendar weeks whether they belong to the treatment or control group (TreamtentFinal). -> And here the problem of Nick's code appears: The same entries of market IDs for different calendar weeks (entries in MAIDCW1) receive different values in TreatmentFinal (0 or 1) even though they all should have the same value.
Here is the complete code including Nick’s suggestion:
clear
import excel "Example.xlsx", sheet("Example") firstrow
set seed 123
bysort MarktName: gen double rand1=rnormal() if _n==1
egen max1=max(rand1), by(MarktName)
replace rand1=max1
bysort AbteilungName: gen double rand2=rnormal() if _n==1
egen max2=max(rand2), by(AbteilungName)
replace rand2=max2
sort rand1 rand2
// Randomization pattern
gen Randomisierung =
"111000011100001110000111100011110001111000011100001110000111100011"
// Assignment to treatment (1) or control group (0)
gen Treatment = substr(Randomisierung, _n, 1)
drop Randomisierung
//Here the relevant part begins:
gen TreatmentFinal = 0
gen strboth = MAID + Treatment
gen long obsno = _n
egen long group = group(strboth)
su group, meanonly
local G = r(max)
quietly forval g = 1/`G' {
su obsno if group == `g', meanonly
count if strboth[r(min)] == MAIDCW1
if r(N) > 0 replace TreatmentFinal = 1 if group == `g'
}
Upvotes: 2
Views: 334
Reputation: 37183
This might work. I doubt it's very good Stata style, and someone else may think of a better approach, but let's get you closer to something that works. Some people would see this as a problem for merge
.
What we should do is loop over the distinct values of strboth
and NOT repeat comparisons for each observation with the same value.
EDIT: Note the emphasis here. If 42 occurs repeatedly in one variable, then checking whether it occurs in the other variable need only be done once. You should read the help for group()
in help egen
: it maps the distinct values of its argument(s) to integers 1 up. So distinct strings frog newt toad
would be mapped to 1 2 3
.
gen TreatmentFinal = 0
gen long obsno = _n
egen long group = group(strboth)
su group, meanonly
local G = r(max)
quietly forval g = 1/`G' {
su obsno if group == `g', meanonly
count if strboth[r(min)] == MAIDCW1
if r(N) > 0 replace TreatmentFinal = 1 if group == `g'
}
EDIT: Don't be misled by the option name meanonly
. summarize
is used here to get the minimum as a safe way to look up the value of one variable indexed by each distinct value of group
.
Mistakes in your code:
forvalues
won't evaluate _N
on the fly.
Looping over observations doesn't itself imply focus on a particular observation each time around the loop.
The display format of strings is not relevant to this problem.
Other points:
There is no need to replace
0 with 0.
cond(index(strboth, MAIDCW1[`i']) > 0, 1, 0)
is equivalent to
index(strboth, MAIDCW1[`i']) > 0`
Upvotes: 0