Brandon Brooks
Brandon Brooks

Reputation: 3

How do I remove a row based on a specific vector within a data.frame?

I have an online survey dataset in which there are multiple complete attempts by participants and I need to selectively remove several of the cases by row number. The data is stored as a data.frame. I realize I could do this manually, but I want to keep this as a script so that I can use it later if need be or someone can duplicate what I've done quickly and efficiently.

What I have tried: I have searched in multiple locations, but my question seems too simple. I have looked at removing rows based on incomplete cases ('complete.cases' and 'na.omit'), but this is not specifically want I want as I am trying to remove a row based on a specific vector within the data.frame

The data:

user_id var1 var2 var3
1         NA   13  bob
3       time   37 fred
4     second   NA lisa
5     second   28 lisa

So, in the above data.frame I have multiple attempts by user lisa. I want to keep her last attempt because it is more complete (no NA in var2), but I need to remove the row based on user_id rather than var3.

Upvotes: 0

Views: 868

Answers (2)

setwd("~/Stack Overflow")
MultipleSurveys <- read.table("~/Stack Overflow/ ...
MultipleSurveys.txt", header=T, quote="\"")
SurvDat <- MultipleSurveys[,-ncol(MultipleSurveys)][,-1]
NbNA <- rowSums(is.na(SurvDat)); names(NbNA) <- "NbNA"
AMS <- cbind(MultipleSurveys,NbNA)
minNA <- function(DT){
  NbSurv <- nrow(DT)
  if (NbSurv==1) return(DT)
  else{
    OldRow <- DT[1,]
    for (r in 2:NbSurv){
      NewRow <- DT[r,]
      if (NewRow$NbNA<=OldRow$NbNA) OldRow <- NewRow
    }
    return(OldRow)
  }
}
(SingleSurveys <- by(AMS,AMS$user,minNA))

Upvotes: 0

Spacedman
Spacedman

Reputation: 94172

Starting with:

> data
  var1 var2 var3  user
1    1   NA    2   bob
2   34    3    1   bob
3   NA   NA    2   bob
4    1    2    3  lisa
5    1   NA    2  lisa
6    3    4    5   joe
7    6   NA    4 simon

first compute the completeness score by summing the number of non-NA values in var1 to var3:

> data$score = apply(data[,c("var1","var2","var3")],1,function(x){sum(!is.na(x))})
> data
  var1 var2 var3  user score
1    1   NA    2   bob     2
2   34    3    1   bob     3
3   NA   NA    2   bob     1
4    1    2    3  lisa     3
5    1   NA    2  lisa     2
6    3    4    5   joe     3
7    6   NA    4 simon     2

Then find the row with max(score) in each group. There's probably an easier way to do this:

> pick = unlist(tapply(1:7,data$user,
      function(x){x[data[x,"score"]==max(data[x,"score"])]}))
> pick
  bob   joe  lisa simon 
    2     6     4     7 
> data[pick,]
  var1 var2 var3  user score
2   34    3    1   bob     3
6    3    4    5   joe     3
4    1    2    3  lisa     3
7    6   NA    4 simon     2

If someone has two rows with the same score they'll appear twice:

> data[2,'var2']=NA
> data$score = apply(data[,c("var1","var2","var3")],1,function(x){sum(!is.na(x))})

Now if I recompute pick I get bob twice:

> pick = unlist(tapply(1:7,data$user,
    function(x){x[data[x,"score"]==max(data[x,"score"])]}))
> pick
 bob1  bob2   joe  lisa simon 
    1     2     6     4     7 

Which can be fixed by just returning the first match in the pick calculation:

> pick = unlist(tapply(1:7,data$user,
  function(x){x[data[x,"score"]==max(data[x,"score"])][1]}))
> pick
  bob   joe  lisa simon 
    1     6     4     7 

You didn't say what you wanted doing with duplicates...

Someone will probably have a one-liner posted in a tic...

Upvotes: 1

Related Questions