Reputation: 115
I'm trying to collapse
only a subset of my data using if
, but it seems to be dropping / collapsing much more than I expect.
With every other command with which I have used an if
qualifier, the command applies only to the subset of the data that meets the if
criteria and leaves the rest of the data alone.
For example, replace
does not alter the data for which foreign != 1
:
. sysuse auto, clear
(1978 Automobile Data)
. replace mpg = 16 if foreign == 1
(22 real changes made)
However, it appears that collapse
applies to the data that meets the if
criteria and drops the rest:
. count if mpg > -1
74
. * all the data has mpg > -1
. count if foreign == 1
22
. collapse (mean) mpg if foreign == 1
. count if mpg > -1
1
There is no reason why collapse
could not in theory work the same way as replace
. It could leave all the foreign != 1
intact, while collapsing all foreign == 1
data to one observation.
That is in fact what I want to do with my data, so what should I do differently?
@NickCox helpfully suggested something like this:
. save "temp/whatever"
file temp/whatever.dta saved
. sysuse auto, clear
(1978 Automobile Data)
. drop if foreign == 1
(22 observations deleted)
. append using "temp/whatever"
(note: variable mpg was int, now float to accommodate using data's values)
That works in this sandbox, but my dataset has 10 million observations. If I can avoid having to re-load it, I can save myself a half hour. More if I have to do this for multiple cases.
Any other suggestions would be appreciated.
Upvotes: 0
Views: 7279
Reputation: 37358
collapse
with if
works this way:
Those observations selected by the if
condition are collapse
d, typically (but not necessarily) into a new dataset with fewer observations.
Those observations not selected disappear.
It's incorrect to say that this command is unusual, let alone unique, in that respect. contract
and keep
also work in this way: whatever is not selected disappears.
(The community has often asked for save
with if
: savesome
from SSC is one work-around.)
If you want to collapse
some of the observations but leave the others unchanged, then you can try
A. this strategy
A1. use
your dataset
A2. keep if
what you want unchanged and save
those observations
A3. use
your dataset again
A4. collapse
to taste
A5. append
the dataset from A2
sysuse auto, clear
keep if !foreign
save domestic
sysuse auto, clear
collapse mpg if foreign
gen make = "All foreign"
append using domestic
or B. this one:
B1. work with (if needed create) an identifier that is unique (distinct) for the observations you want unchanged but takes on a single value for the observations you want collapse
d
B2. collapse
feeding that identifier to by()
.
sysuse auto, clear
replace make = "All foreign" if foreign
collapse mpg, by(make)
Although B looks trivial for this example, it is far from obvious to me that it is always superior for large datasets and if you want to continue to work with many variables. I have not experimented with timing or memory comparisons for large datasets or even any datasets, as I haven't encountered this wish before.
Upvotes: 1