Gina Zetkin
Gina Zetkin

Reputation: 333

Replacing levels of multiple factors

I would need to replace the levels of multiple factors in one data frame, so they would be all unified. These are, for example, the levels in the one of those factors:

> levels(workco[,5])
 [1] " "                              "1"                              "2"                             
 [4] "kóko"                          "kesätyö"                      "Kesätyö kokoaika"            
 [7] "koko"                           "kokop"                          "kokop."                        
[10] "Kokopäivä"                    "kokopäiväinen"                "Kokopäiväinen"               
[13] "kokopäiväinen / osa-aikainen" "kokopäivänen"                 "kokp"                          
[16] "kokp."                          "Kokp."                          "osa-aik"                       
[19] "Osa-aik / Kokopäiv."           "osa-aik."                       "Osa-aik."                      
[22] "osa-aikainen"                   "Osa-aikainen"                   "osa-aikainen/kokopäiväinen"  
[25] "Osa/kokoaikainen"               "Osap."                  

Let's say I have 12 columns that are all factors, and these have different level names referring to the same meaning expressed differently: as you can see from the example, many of them show the same letters within the level names: koko, kok, kokop... There are three levels I want to obtain by unifying: kokop, osa and kes. Also the levels named after numbers 1 and 2 should be recoded into kokop and osa, respectively.

So far the things I have tried don't work out, I am afraid it's because I thinking in a more complicated way than it actually is: I have tried loops using the adist() function and also grep() separately, but I get find errors. For example:

code <- c("kok","osa","ma","kes",1,2," ")
list.names <- c("1", "2", "3", "4", "5", "6","7","8","9","10","11","12")
mylist <- vector("list", length(list.names))
names(mylist) <- list.names
D <- mylist
index <- mylist

for (i in ncol(workco2)){                            
  D[[i]] <- adist(workco2[,i],code,ignore.case=TRUE)
  index[[i]] <- lapply(D[[i]],which.min)
  workco2[,i] <- data.frame(code[index[[i]]])
}

And this error message:

Error in code[index[[i]]] : invalid subscript type 'list'

Could you be so kind to hint me how you would solve it? Probably is much simpler than I think =/ Thanks beforehand!

Upvotes: 1

Views: 1024

Answers (2)

Ruthger Righart
Ruthger Righart

Reputation: 4921

It is my guess that you need a combination of grep & replace. This may speed-up changing levels with similar syllables ("ko", "kok").

Data example

code <- as.factor(c("kok","osa","ma","kes", "koko", "osa-aikainen", "osa/kes"))

Add level

levels(code) <- c(levels(code), "kokop")

Replace all instances containing "kok" with "kokop"

new.code <- replace(code, (grep ("kok", code)), "kokop")

Replace all instances containing "osa/kes" with "kes"

new.code <- replace(code, (grep ("osa/kes", code)), "kes")

Use shorter strings, for ex. "ko", to change levels with similar syllables ("ko", "kok")

new.code <- replace(code, (grep ("ko", code)), "kokop")

Upvotes: 0

Roman Luštrik
Roman Luštrik

Reputation: 70623

I usually merge factors as demonstrated in the example below. I subset levels that correspond to my criterion (... %in% c(...)) and overwrite them with a new level.

set.seed(357)
xy <- data.frame(name = sample(letters[1:4], size = 20, replace = TRUE), value = runif(20))
xy$name
  [1] a a b a c b d c d d c c b a c a b d c b
  Levels: a b c d
levels(xy$name)[levels(xy$name) %in% c("a", "b")] <- "a-b"
levels(xy$name)[levels(xy$name) %in% c("c", "d")] <- "c-d"
xy$name
 [1] a-b a-b a-b a-b c-d a-b c-d c-d c-d c-d c-d c-d a-b a-b c-d a-b a-b c-d c-d a-b
Levels: a-b c-d

Upvotes: 1

Related Questions