Karin
Karin

Reputation: 11

Keep only observations with data for all variables

I have a dataset with several variables. I want to create a subsample which only includes the observations which have data for all variables, so no missing data in any of the variable.

I know about the dropmiss command in Stata but that does not apply here because I do not want to drop variables, but I want to drop the observations.

I found a question similar to mine in Stack Overflow, but the statistical program used there is SAS and I am using Stata. (SAS - Keeping only observations with all variables).

An example (the "." is a missing data):

ID year pension age gender
1  2006 300    54   F
2  2007 250    40   M
3  2006  .     45   M
4  2005  .      .   F

So in this case I only want to keep ID 1 and 2, and drop 3 and 4 from the sample since it contains missing data for some of the variables.

Upvotes: 1

Views: 4494

Answers (2)

Nick Cox
Nick Cox

Reputation: 37288

The statement about dropmiss (download from Stata Journal website after search dropmiss) is incorrect.

dropmiss has an obs option geared to this need.

. sysuse auto, clear
(1978 Automobile Data)

. dropmiss, obs
(0 observations deleted)

. dropmiss, obs any
(5 observations deleted)

However, dropmiss is considered by its author (that's me) to be superseded by missings (download similarly from Stata Journal website). missings doesn't support this directly, as considering whether missing values can be handled by multiple imputation is widely considered better statistical practice.

But if you insist, missings can help with this too:

. sysuse auto, clear
(1978 Automobile Data)

. missings tag, gen(anymiss)

Checking missings in all variables:
5 observations with missing values

. drop if anymiss
(5 observations deleted)

There's an egen function rowmiss() that behaves similarly.

What's key here is that you don't need to spell out the variable names concerned. However, watch out: these commands can be highly destructive.

Upvotes: 2

ℕʘʘḆḽḘ
ℕʘʘḆḽḘ

Reputation: 19405

The answer is pretty simple assuming you have a limited number of variables. Just type:

keep if !missing(var1) & !missing(var2) & !missing(var3)

That command will only keep rows that have non-missing values of all of the three variables mentioned above. Feel free to add more.

Upvotes: 0

Related Questions