Reputation: 921

how to correct a mistake made in the levels of a factor variable?

let's say I have this dataframe

d = data.frame(x = c("1","1 2", "1 3", "2 3", "3", "4"))
d

and it has the variable x as a factor

d$x = as.factor(d$x)

However I discover an error in three of the levels that I wrote.

So I want to replace the values of these variables and their levels as follows :

I want to replace 1 2 with 1

I want to replace 1 3 with 1

I want to replace 2 3 with 2

levels(d$x)

so I want to correct it. when using the following method :

d$x[which(d$x == "1 2")] <- "1"
d$x[which(d$x == "1 3")] <- "1"
d$x[which(d$x == "2 3")] <- "2"

It create levels as follows

1 1 1 2 3 4

What I wish is the levels as follows

1 2 3 4

What should I do to handle this problem ? thanks

Upvotes: 0

Answers (5)

Ottie

Reputation: 1030

Copying from my answer to a recent question:

Under the hood, a factor array is an integer array with labels (levels). You can rename the labels alone without touching the underlying array.

d = data.frame(x = factor(c("1","1 2", "1 3", "2 3", "3", "4")))
levels(d$x)
[1] "1"   "1 2" "1 3" "2 3" "3"   "4" 

levels(d$x) <- c(1, 1, 1, 2, 3, 4)
levels(d$x)
[1] "1" "2" "3" "4"

d$x
[1] 1 1 1 2 3 4
Levels: 1 2 3 4

If you have more levels, and don't want to risk a manual assignment, you can create a dictionary of replacement values

d = data.frame(x = factor(c("1","1 2", "1 3", "2 3", "3", "4")))
dict <- setNames(
    gsub(' .$', '', levels(d$x)), # remove spaces and any character after that
    levels(d$x)
)
dict
  1 1 2 1 3 2 3   3   4 
"1" "1" "1" "2" "3" "4"

You can then use the dictionary to replace existing level labels with new ones

levels(d$x) <- dict[levels(d$x)]
d$x
[1] 1 1 1 2 3 4
Levels: 1 2 3 4

Upvotes: 1

Joris C.

Reputation: 6244

There is also a dedicated function recode() in dplyr for this purpose:

library(dplyr)

## initial factor
x <- factor(c("1","1 2", "1 3", "2 3", "3", "4"))
levels(x)
#> [1] "1"   "1 2" "1 3" "2 3" "3"   "4"

## edited factor
recode(x, "1 2" = "1", "1 3" = "1", "2 3" = "2")
#> [1] 1 1 1 2 3 4
#> Levels: 1 2 3 4

P.S.: you should not edit your question in such a way that it invalidates (previously valid) answers.

Upvotes: 1

s_baldur

Reputation: 33743

Another option is turning back to character while modifying:

d$x <- as.character(d$x)
d$x <- factor(sub(" .+", "", d$x))

d$x
# [1] 1 1 1 2 3 4
# Levels: 1 2 3 4

Upvotes: 2

Maël

Reputation: 52389

You can use fct_collapse:

library(dplyr)
library(forcats)
d %>% 
  mutate(x = fct_collapse(x, 
                          "1" = c("1", "1 2", "1 3"),
                          "2" = c("2", "2 3")))
  x
1 1
2 1
3 1
4 2
5 3
6 4

Upvotes: 1

AndS.

Reputation: 8120

How about this? You split the text by the space and then you unnest the lists to long format. This will work if there are many issues. This also assumes that there is a space that defines the error as per your example.

library(tidyverse)

d <-  data.frame(x = c("1","2", "3 4", "5", "6"))

d |>
  mutate(x = str_split(x, pattern = "\\s")) |>
  unnest_longer(x)
#> # A tibble: 6 x 1
#>   x    
#>   <chr>
#> 1 1    
#> 2 2    
#> 3 3    
#> 4 4    
#> 5 5    
#> 6 6

Edit based on comments: Here are two methods. One with tidyverse and one using base R.

library(tidyverse)
  
d <-  data.frame(x = c("1","2", "3 4", "5", "6"))

d |>
  mutate(x = str_remove(x, "\\s4$")) 
#>   x
#> 1 1
#> 2 2
#> 3 3
#> 4 5
#> 5 6

d$x[which(d$x == "3 4")] <- "3"
d
#>   x
#> 1 1
#> 2 2
#> 3 3
#> 4 5
#> 5 6

Another edit based on more info:

d = data.frame(x = c("1","1 2", "1 3", "2 3", "3", "4"))

d$x <- as.factor(gsub("(\\d+)\\s\\d+$", "\\1", d$x))

d
#>   x
#> 1 1
#> 2 1
#> 3 1
#> 4 2
#> 5 3
#> 6 4

levels(d$x)
#> [1] "1" "2" "3" "4"

Upvotes: 1

how to correct a mistake made in the levels of a factor variable?

Answers (5)

Related Questions