zwol
zwol

Reputation: 140669

Deduplicate levels of a factor

Suppose I have this object, which is the dput() form of an invalid factor (for instance, printing it will complain about the duplicate level 3):

x <- structure(c(1L, 2L, 3L, 4L), .Label = c("A", "B", "A", "C"),
               class = "factor")

What is the best way, using only base R, to convert it to the valid factor

structure(c(1L, 2L, 1L, 3L), .Label = c("A", "B", "C"), class = "factor")

I managed to come up with

factor(levels(x)[x])

but I'm not certain that this will keep working in the future without warnings, and it's probably also quite inefficient (the real factor object that I'm trying to repair is enormous).

Upvotes: 4

Views: 91

Answers (1)

John Coleman
John Coleman

Reputation: 51998

Your method seems good, and fairly efficient. To experiment, I created a function to make such malformed factors:

bad.factor <- function(nums,labs){
  structure(nums, .Label = labs, class = "factor")}

If you use:

x <- bad.factor(1:1000000,gtools::chr(runif(1000000,65,90)))

Then run:

microbenchmark::microbenchmark(factor(levels(x)[x]))

Typical output is:

 Unit: milliseconds
                 expr      min       lq     mean   median       uq      max neval
 factor(levels(x)[x]) 27.72593 32.98346 42.97813 34.11871 35.70919 105.3564   100

Upvotes: 1

Related Questions