Reputation: 161
I need to create a column in a dataset that reports the most recent row-wise modal text value in a selection of columns (ignoring NAs).
Background: I've a dataset where up to 4 coders rated participant transcripts (one participant/row). Occasionally a minority of coders either disagree or select the wrong code for a participant/row. So I need to reproducibly select the modal code response across coders for each participant (i.e., for each row) and—when there is a tie—select the most recent (later) modal code responses (because later codings are more likely to be correct).
Here's a fake example of the dataset with four coder's codes (Essay or Chat) for 3 participants (one/row).
> fakeData = data.frame(id = 1:3,
+ Condition = c("Essay", "Chat", "Chat"),
+ FirstCoder = c("NA","Essay","Essay"),
+ SecondCoder = c("NA","Chat","Essay"),
+ ThirdCoder = c("Essay","Chat","Chat"),
+ FourthCoder = c("Essay","NA","Chat"))
> fakeData
id Condition FirstCoder SecondCoder ThirdCoder FourthCoder
1 1 Essay NA NA Essay Essay
2 2 Chat Essay Chat Chat NA
3 3 Chat Essay Essay Chat Chat
Regarding recency: The "FirstCoder" coded first, "SecondCoder" coded next, then the "ThirdCoder" submitted their code, and "FourthCoder" was the last (and most recent) coder to submit a response.
Here are some methods I've tried from other forums—notice how I need to ignore the "Condition" column:
> fakeData$ModalCode1 <- apply(fakeData,1,function(x) names(which.max(table(c("FirstCoder","SecondCoder", "ThirdCoder", "FourthCoder")))))
> fakeData$ModalCode2 <- apply(select(fakeData,ends_with("Coder")), 1, Mode)
The correct result would be this column (created manually)
> fakeData$MostRecentModalCode <- c("Essay", "Chat", "Chat")
You can see that none of my attempts are getting the correct result (i.e., "MostRecentModalCode").
> fakeData
id Condition FirstCoder SecondCoder ThirdCoder FourthCoder ModalCode1 ModalCode2 MostRecentModalCode
1 1 Essay NA NA Essay Essay FirstCoder NA Essay
2 2 Chat Essay Chat Chat NA FirstCoder Chat Chat
3 3 Chat Essay Essay Chat Chat FirstCoder Essay Chat
As you can see the final (correct) column ignores NAs and breaks modal ties with the more recent coders' responses (unlike the traditional Mode function).
Surely there's a function for this, but I am just failing to find or correctly implement it.
Advice and solutions welcome! (If I have to create a custom function, that's fine—albeit surprising.)
Upvotes: 2
Views: 62
Reputation: 101024
If you work with data.table
, you can try the code below
library(data.table)
melt(setDT(fakeData),
id.vars = "id", na.rm = TRUE
)[
, .N,
.(id, value)
][
, .(value = value[which.max(N)]),
id
]
which gives
id value
1: 1 Essay
2: 2 Chat
3: 3 Chat
Upvotes: 0
Reputation: 39647
You can use:
apply(fakeData[-1], 1, \(x) names(which(max(table(x))==table(x))))
#[1] "Essay" "Chat" "Chat"
Which will return all most frequent levels in case there are more than one.
Upvotes: 0
Reputation: 1309
What about:
apply(fakeData[,-1], 1, DescTools::Mode, na.rm=TRUE)
?
Upvotes: 0
Reputation: 161
@akrun's answer pointed me to another post that had a custom Mode function buried in the answers that fit my needs. I've renamed it ModeC
, adapted from Mode
in @DanHoughton's answer (https://stackoverflow.com/a/53290748/1701844).
ModeC <- function(x) {
if ( length(x) <= 2 ) return(x[1])
if ( anyNA(x) ) x = x[!is.na(x)]
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
For reasons I do not understand, it fails to ignore NAs on the fakeData (whether its a data.table or a data.frame and even when the NAs are not just "NA" strings), but it correctly ignores NAs when determining the mode in my actual data. So I am posting it here in case it works for others.
Upvotes: 1