Reputation: 3
I have an online survey dataset in which there are multiple complete attempts by participants and I need to selectively remove several of the cases by row number. The data is stored as a data.frame. I realize I could do this manually, but I want to keep this as a script so that I can use it later if need be or someone can duplicate what I've done quickly and efficiently.
What I have tried: I have searched in multiple locations, but my question seems too simple. I have looked at removing rows based on incomplete cases ('complete.cases' and 'na.omit'), but this is not specifically want I want as I am trying to remove a row based on a specific vector within the data.frame
The data:
user_id var1 var2 var3
1 NA 13 bob
3 time 37 fred
4 second NA lisa
5 second 28 lisa
So, in the above data.frame I have multiple attempts by user lisa. I want to keep her last attempt because it is more complete (no NA in var2
), but I need to remove the row based on user_id
rather than var3
.
Upvotes: 0
Views: 868
Reputation: 181
setwd("~/Stack Overflow")
MultipleSurveys <- read.table("~/Stack Overflow/ ...
MultipleSurveys.txt", header=T, quote="\"")
SurvDat <- MultipleSurveys[,-ncol(MultipleSurveys)][,-1]
NbNA <- rowSums(is.na(SurvDat)); names(NbNA) <- "NbNA"
AMS <- cbind(MultipleSurveys,NbNA)
minNA <- function(DT){
NbSurv <- nrow(DT)
if (NbSurv==1) return(DT)
else{
OldRow <- DT[1,]
for (r in 2:NbSurv){
NewRow <- DT[r,]
if (NewRow$NbNA<=OldRow$NbNA) OldRow <- NewRow
}
return(OldRow)
}
}
(SingleSurveys <- by(AMS,AMS$user,minNA))
Upvotes: 0
Reputation: 94172
Starting with:
> data
var1 var2 var3 user
1 1 NA 2 bob
2 34 3 1 bob
3 NA NA 2 bob
4 1 2 3 lisa
5 1 NA 2 lisa
6 3 4 5 joe
7 6 NA 4 simon
first compute the completeness score by summing the number of non-NA values in var1 to var3:
> data$score = apply(data[,c("var1","var2","var3")],1,function(x){sum(!is.na(x))})
> data
var1 var2 var3 user score
1 1 NA 2 bob 2
2 34 3 1 bob 3
3 NA NA 2 bob 1
4 1 2 3 lisa 3
5 1 NA 2 lisa 2
6 3 4 5 joe 3
7 6 NA 4 simon 2
Then find the row with max(score) in each group. There's probably an easier way to do this:
> pick = unlist(tapply(1:7,data$user,
function(x){x[data[x,"score"]==max(data[x,"score"])]}))
> pick
bob joe lisa simon
2 6 4 7
> data[pick,]
var1 var2 var3 user score
2 34 3 1 bob 3
6 3 4 5 joe 3
4 1 2 3 lisa 3
7 6 NA 4 simon 2
If someone has two rows with the same score they'll appear twice:
> data[2,'var2']=NA
> data$score = apply(data[,c("var1","var2","var3")],1,function(x){sum(!is.na(x))})
Now if I recompute pick I get bob twice:
> pick = unlist(tapply(1:7,data$user,
function(x){x[data[x,"score"]==max(data[x,"score"])]}))
> pick
bob1 bob2 joe lisa simon
1 2 6 4 7
Which can be fixed by just returning the first match in the pick calculation:
> pick = unlist(tapply(1:7,data$user,
function(x){x[data[x,"score"]==max(data[x,"score"])][1]}))
> pick
bob joe lisa simon
1 6 4 7
You didn't say what you wanted doing with duplicates...
Someone will probably have a one-liner posted in a tic...
Upvotes: 1