RGecon
RGecon

Reputation: 115

Collapse only a subset of the dataset with "if"

I'm trying to collapse only a subset of my data using if, but it seems to be dropping / collapsing much more than I expect.

With every other command with which I have used an if qualifier, the command applies only to the subset of the data that meets the if criteria and leaves the rest of the data alone.

For example, replace does not alter the data for which foreign != 1:

. sysuse auto, clear
(1978 Automobile Data)

. replace mpg = 16 if foreign == 1
(22 real changes made)

However, it appears that collapse applies to the data that meets the if criteria and drops the rest:

. count if mpg > -1
  74

. * all the data has mpg > -1

. count if foreign == 1
  22

. collapse (mean) mpg if foreign == 1

. count if mpg > -1
  1

There is no reason why collapse could not in theory work the same way as replace. It could leave all the foreign != 1 intact, while collapsing all foreign == 1 data to one observation.

That is in fact what I want to do with my data, so what should I do differently?

@NickCox helpfully suggested something like this:

. save "temp/whatever"
file temp/whatever.dta saved

. sysuse auto, clear
(1978 Automobile Data)

. drop if foreign == 1
(22 observations deleted)

. append using "temp/whatever"
(note: variable mpg was int, now float to accommodate using data's values)

That works in this sandbox, but my dataset has 10 million observations. If I can avoid having to re-load it, I can save myself a half hour. More if I have to do this for multiple cases.

Any other suggestions would be appreciated.

Upvotes: 0

Views: 7279

Answers (1)

Nick Cox
Nick Cox

Reputation: 37358

collapse with if works this way:

Those observations selected by the if condition are collapsed, typically (but not necessarily) into a new dataset with fewer observations.

Those observations not selected disappear.

It's incorrect to say that this command is unusual, let alone unique, in that respect. contract and keep also work in this way: whatever is not selected disappears. (The community has often asked for save with if: savesome from SSC is one work-around.)

If you want to collapse some of the observations but leave the others unchanged, then you can try

A. this strategy

A1. use your dataset

A2. keep if what you want unchanged and save those observations

A3. use your dataset again

A4. collapse to taste

A5. append the dataset from A2

sysuse auto, clear
keep if !foreign 
save domestic 
sysuse auto, clear
collapse mpg if foreign 
gen make = "All foreign" 
append using domestic 

or B. this one:

B1. work with (if needed create) an identifier that is unique (distinct) for the observations you want unchanged but takes on a single value for the observations you want collapsed

B2. collapse feeding that identifier to by().

sysuse auto, clear 
replace make = "All foreign" if foreign 
collapse mpg, by(make) 

Although B looks trivial for this example, it is far from obvious to me that it is always superior for large datasets and if you want to continue to work with many variables. I have not experimented with timing or memory comparisons for large datasets or even any datasets, as I haven't encountered this wish before.

Upvotes: 1

Related Questions