Reputation: 161

Most frequent factor across specific columns—with recency breaking ties

I need to create a column in a dataset that reports the most recent row-wise modal text value in a selection of columns (ignoring NAs).

Background: I've a dataset where up to 4 coders rated participant transcripts (one participant/row). Occasionally a minority of coders either disagree or select the wrong code for a participant/row. So I need to reproducibly select the modal code response across coders for each participant (i.e., for each row) and—when there is a tie—select the most recent (later) modal code responses (because later codings are more likely to be correct).

Here's a fake example of the dataset with four coder's codes (Essay or Chat) for 3 participants (one/row).

> fakeData = data.frame(id = 1:3,
+                 Condition = c("Essay", "Chat", "Chat"),
+                 FirstCoder = c("NA","Essay","Essay"),
+                 SecondCoder = c("NA","Chat","Essay"),
+                 ThirdCoder = c("Essay","Chat","Chat"),
+                 FourthCoder = c("Essay","NA","Chat"))
> fakeData
  id Condition FirstCoder SecondCoder ThirdCoder FourthCoder
1  1     Essay         NA          NA      Essay       Essay
2  2      Chat      Essay        Chat       Chat          NA
3  3      Chat      Essay       Essay       Chat        Chat

Regarding recency: The "FirstCoder" coded first, "SecondCoder" coded next, then the "ThirdCoder" submitted their code, and "FourthCoder" was the last (and most recent) coder to submit a response.

Here are some methods I've tried from other forums—notice how I need to ignore the "Condition" column:

> fakeData$ModalCode1 <- apply(fakeData,1,function(x) names(which.max(table(c("FirstCoder","SecondCoder", "ThirdCoder", "FourthCoder")))))
> fakeData$ModalCode2 <- apply(select(fakeData,ends_with("Coder")), 1, Mode)

The correct result would be this column (created manually)

> fakeData$MostRecentModalCode <- c("Essay", "Chat", "Chat")

You can see that none of my attempts are getting the correct result (i.e., "MostRecentModalCode").

> fakeData
  id Condition FirstCoder SecondCoder ThirdCoder FourthCoder ModalCode1 ModalCode2 MostRecentModalCode
1  1     Essay         NA          NA      Essay       Essay FirstCoder         NA               Essay
2  2      Chat      Essay        Chat       Chat          NA FirstCoder       Chat                Chat
3  3      Chat      Essay       Essay       Chat        Chat FirstCoder      Essay                Chat

As you can see the final (correct) column ignores NAs and breaks modal ties with the more recent coders' responses (unlike the traditional Mode function).

Surely there's a function for this, but I am just failing to find or correctly implement it.

Advice and solutions welcome! (If I have to create a custom function, that's fine—albeit surprising.)

Upvotes: 2

Answers (5)

ThomasIsCoding

Reputation: 101024

If you work with data.table, you can try the code below

library(data.table)

melt(setDT(fakeData),
  id.vars = "id", na.rm = TRUE
)[
  , .N,
  .(id, value)
][
  , .(value = value[which.max(N)]),
  id
]

which gives

   id value
1:  1 Essay
2:  2  Chat
3:  3  Chat

Upvotes: 0

GKi

Reputation: 39647

You can use:

apply(fakeData[-1], 1, \(x) names(which(max(table(x))==table(x))))
#[1] "Essay" "Chat"  "Chat"

Which will return all most frequent levels in case there are more than one.

Upvotes: 0

Andri Signorell

Reputation: 1309

What about:

apply(fakeData[,-1], 1, DescTools::Mode, na.rm=TRUE)

Upvotes: 0

Nick Byrd

Reputation: 161

@akrun's answer pointed me to another post that had a custom Mode function buried in the answers that fit my needs. I've renamed it ModeC, adapted from Mode in @DanHoughton's answer (https://stackoverflow.com/a/53290748/1701844).

ModeC <- function(x) {
  if ( length(x) <= 2 ) return(x[1])
  if ( anyNA(x) ) x = x[!is.na(x)]
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

For reasons I do not understand, it fails to ignore NAs on the fakeData (whether its a data.table or a data.frame and even when the NAs are not just "NA" strings), but it correctly ignores NAs when determining the mode in my actual data. So I am posting it here in case it works for others.

Upvotes: 1

akrun

Reputation: 886938

We can use the Mode function from here

> Mode <- function(x) {
+   ux <- unique(x)
+   ux[which.max(tabulate(match(x, ux)))]
+ }
> 
> apply(fakeData[-1], 1, Mode)
[1] "Essay" "Chat"  "Chat"

Upvotes: 3

Most frequent factor across specific columns—with recency breaking ties

Answers (5)

Related Questions