Reputation: 1559
I have a dataset that tells me how many referrals each General Practitioner (GP) makes to each hospital.
If at least 1 GP in the data refers patients to two (or more) different hospitals, then I want to run some additional code, otherwise I don't.
I am using this code:
set more off
gsort GP -referrals
by code: gen nvals = _n ==1
generate obs = _N
if nvals != obs {
display "different number of unique observations as total observations-therefore I will run additional code here"
continue
}
display "same number of unique observations as total observations-therefore for this loop I don't wish to run additional code"
At the moment this doesn't seem to be working.
Could someone help me develop this code please? I.e. so that if total number of observations is equal to the total number of unique observations, I know that I can skip the next section of code- which would go where I currently have:
display "different number of unique observations as total observations-therefore I will run additional code here"
Upvotes: 0
Views: 745
Reputation: 37278
There are two related problems here. I'll separate them out:
1. The command if (as distinct from the if qualifier)
This kind of construct
if nvals != obs {
...
}
can be a major source of Stata bugs, usually biting those who are accustomed to that being interpreted in a particular way in other software.
If the two items being compared are scalars or macros, then all is usually well. (If the two items can't be compared, then Stata will complain, but that is not the issue here and is only briefly puzzling.)
The problem may arise if the two items are variables, as is so both in your question and in your answer. Stata does not recognise this construct as a tacit loop, so that the decision is made repeatedly for each observation. Instead, Stata always interprets that as (in this case)
if nvals[1] != obs[1] {
...
}
so that, in general, Stata looks at the values of the variable in the first observation, and only that observation. If the variable is in fact constant across observations, all will be well; otherwise the code will run as legal, but may well give answers that are at least puzzling and often wrong.
This pitfall is an FAQ as can be seen here.
2. Distinct values
A different problem is that your code in the question will not produce the number of distinct (you say "unique") values in any but extreme circumstances. You don't provide a reproducible example but in any dataset whatever with a variable code
the segment
by code: gen nvals = _n ==1
generate obs = _N
will produce one variable nvals
with values 1 or 0 and another variable obs
containing the number of observations. The two will be equal if, and only if, there is only one observation in the entire dataset and the calculation says nothing about distinct values in any other situation. Although you presumably realised that, people interested in the thread should be interested in the logic.
The code
by code: gen nvals = _n ==1
generate firststep = sum(nvals)
egen unique = max(firststep)
does count the number of distinct values of code
. As that is a scalar, a simpler way to do it might be
by code: gen nvals = _n ==1
count if nvals == 1
scalar unique = r(N)
without needing to create two extra variables firststep
and unique
. The variable obs
is also redundant, as the if
statement could just be
if unique != _N
For a review of this question, including comments on the term "unique", see this paper. If you are interested in the code, search distinct
in Stata to find the most up-to-date version.
1. and 2. together
It should now be clear that the if
command in your own answer will work as desired because the variables in question are constant by construction.
Upvotes: 1
Reputation: 4011
A simple solution is isid
, combined with capture
. For example, the auto
dataset is uniquely identified by the make
variable, but we can generate a non-unique manufacturer
variable to illustrate the idea:
sysuse auto , clear
gen manufacturer = word(make, 1)
capture isid manufacturer
if _rc != 0 di "observations by manufacturer are not unique"
else if _rc == 0 di "observations by manufacturer are unique"
capture isid make
if _rc != 0 di "observations by make are not unique"
else if _rc == 0 di "observations by make are unique"
Upvotes: 1
Reputation: 1559
I have found the solution with some more playing.
set more off
gsort GP -referrals
by code: gen nvals = _n ==1
generate firststep = sum(nvals)
egen unique = max(firststep)
generate obs = _N
if unique != obs {
display "different number of unique observations as total observations-therefore I will run additional code here"
continue
}
display "same number of unique observations as total observations-therefore for this loop I don't wish to run additional code"
Seems to be working
Upvotes: 0