Temitope
Temitope

Reputation: 3

How to overwrite a duplicate observation

I conducted a phone survey and here is the prototype of my dataset:

var1    var2

6666    1
6666    2
7676    2
7676    1
8876    1
8876    2
89898   1
89898   2
9999    1
9999    2
5656    1
5656    2
2323    1
2323    2
9876    1
7654    1

var1 is the unique identifier for each case in my survey (in this case, phone numbers).

var2 is the outcome of the survey: 1 (successful), 2 (not successful).

I want keep the observations for each var1 whose var2 == 1, yet retaining the observations for each var1 whosevar2 == 2 if there is no another case where var2 == 1.

I have tried

duplicates drop var1 if var2 == 2, force

but I am not getting the desired output

Upvotes: 0

Views: 74

Answers (1)

Nick Cox
Nick Cox

Reputation: 37208

The question is wrongly titled: you don't want to overwrite anything.

Your syntax doesn't work as you wish because it is not what you want. You are asking whether there are duplicates of var1 if var2 == 2 and that command pays no attention whatsoever to observations for which var2 == 1.

Your example includes no observations for which var2 == 2 but there is no corresponding observation with var2 == 1. I have added one such.

Here's one way of meeting your goal. I show in passing that the duplicates command you have does nothing for this example; nor would it be expected to do anything.

. clear 

. input  var1    var2

          var1       var2
  1.     6666        1
  2.     6666        2
  3.     7676        2
  4.     7676        1
  5.     8876        1
  6.     8876        2
  7.     89898       1
  8.     89898       2
  9.     9999        1
 10.     9999        2
 11.     5656        1
 12.     5656        2
 13.     2323        1
 14.     2323        2
 15.     9876        1
 16.     7654        1
 17.     42      2 
 18. end 

. duplicates list var1 if var2 == 2 

Duplicates in terms of var1

(0 observations are duplicates)

. bysort var1 (var2) : assert _N == 1 | _N == 2 

. by var1 : drop if _n == 2 & var2[2] == 2 
(7 observations deleted)

. list, sepby(var1) 

     +--------------+
     |  var1   var2 |
     |--------------|
  1. |    42      2 |
     |--------------|
  2. |  2323      1 |
     |--------------|
  3. |  5656      1 |
     |--------------|
  4. |  6666      1 |
     |--------------|
  5. |  7654      1 |
     |--------------|
  6. |  7676      1 |
     |--------------|
  7. |  8876      1 |
     |--------------|
  8. |  9876      1 |
     |--------------|
  9. |  9999      1 |
     |--------------|
 10. | 89898      1 |
     +--------------+

Another way to do it would be

. bysort var1 (var2) : keep if _n == 1 & var2[2] == 2 

In fact

. bysort var1 (var2): keep if _n == 1 

keeps observations with var2 == 1 if there are any and otherwise will also keep singletons with var2 == 2.

The hidden assumptions seem to include at most two observations for each distinct var1. Note the use of assert for checking assumptions about the dataset.

Upvotes: 1

Related Questions