How do I remove a row based on a specific vector within a data.frame?

Question

I have an online survey dataset in which there are multiple complete attempts by participants and I need to selectively remove several of the cases by row number. The data is stored as a data.frame. I realize I could do this manually, but I want to keep this as a script so that I can use it later if need be or someone can duplicate what I've done quickly and efficiently.

What I have tried: I have searched in multiple locations, but my question seems too simple. I have looked at removing rows based on incomplete cases ('complete.cases' and 'na.omit'), but this is not specifically want I want as I am trying to remove a row based on a specific vector within the data.frame

The data:

user_id var1 var2 var3
1         NA   13  bob
3       time   37 fred
4     second   NA lisa
5     second   28 lisa

So, in the above data.frame I have multiple attempts by user lisa. I want to keep her last attempt because it is more complete (no NA in var2), but I need to remove the row based on user_id rather than var3.

Spacedman · Accepted Answer

Starting with:

> data
  var1 var2 var3  user
1    1   NA    2   bob
2   34    3    1   bob
3   NA   NA    2   bob
4    1    2    3  lisa
5    1   NA    2  lisa
6    3    4    5   joe
7    6   NA    4 simon

first compute the completeness score by summing the number of non-NA values in var1 to var3:

> data$score = apply(data[,c("var1","var2","var3")],1,function(x){sum(!is.na(x))})
> data
  var1 var2 var3  user score
1    1   NA    2   bob     2
2   34    3    1   bob     3
3   NA   NA    2   bob     1
4    1    2    3  lisa     3
5    1   NA    2  lisa     2
6    3    4    5   joe     3
7    6   NA    4 simon     2

Then find the row with max(score) in each group. There's probably an easier way to do this:

> pick = unlist(tapply(1:7,data$user,
      function(x){x[data[x,"score"]==max(data[x,"score"])]}))
> pick
  bob   joe  lisa simon 
    2     6     4     7 
> data[pick,]
  var1 var2 var3  user score
2   34    3    1   bob     3
6    3    4    5   joe     3
4    1    2    3  lisa     3
7    6   NA    4 simon     2

If someone has two rows with the same score they'll appear twice:

> data[2,'var2']=NA
> data$score = apply(data[,c("var1","var2","var3")],1,function(x){sum(!is.na(x))})

Now if I recompute pick I get bob twice:

> pick = unlist(tapply(1:7,data$user,
    function(x){x[data[x,"score"]==max(data[x,"score"])]}))
> pick
 bob1  bob2   joe  lisa simon 
    1     2     6     4     7

Which can be fixed by just returning the first match in the pick calculation:

> pick = unlist(tapply(1:7,data$user,
  function(x){x[data[x,"score"]==max(data[x,"score"])][1]}))
> pick
  bob   joe  lisa simon 
    1     6     4     7

You didn't say what you wanted doing with duplicates...

Someone will probably have a one-liner posted in a tic...

How do I remove a row based on a specific vector within a data.frame?

Answers (2)

Related Questions