An116
An116

Reputation: 911

how to correct a mistake made in the levels of a factor variable?

let's say I have this dataframe

d = data.frame(x = c("1","1 2", "1 3", "2 3", "3", "4"))
d

and it has the variable x as a factor

d$x = as.factor(d$x)

However I discover an error in three of the levels that I wrote.

So I want to replace the values of these variables and their levels as follows :

I want to replace 1 2 with 1

I want to replace 1 3 with 1

I want to replace 2 3 with 2

levels(d$x)

so I want to correct it. when using the following method :

d$x[which(d$x == "1 2")] <- "1"
d$x[which(d$x == "1 3")] <- "1"
d$x[which(d$x == "2 3")] <- "2"

It create levels as follows

1 1 1 2 3 4

What I wish is the levels as follows

1 2 3 4

What should I do to handle this problem ? thanks

Upvotes: 0

Views: 495

Answers (5)

Ottie
Ottie

Reputation: 1030

Copying from my answer to a recent question:

Under the hood, a factor array is an integer array with labels (levels). You can rename the labels alone without touching the underlying array.

d = data.frame(x = factor(c("1","1 2", "1 3", "2 3", "3", "4")))
levels(d$x)
[1] "1"   "1 2" "1 3" "2 3" "3"   "4" 

levels(d$x) <- c(1, 1, 1, 2, 3, 4)
levels(d$x)
[1] "1" "2" "3" "4"

d$x
[1] 1 1 1 2 3 4
Levels: 1 2 3 4

If you have more levels, and don't want to risk a manual assignment, you can create a dictionary of replacement values

d = data.frame(x = factor(c("1","1 2", "1 3", "2 3", "3", "4")))
dict <- setNames(
    gsub(' .$', '', levels(d$x)), # remove spaces and any character after that
    levels(d$x)
)
dict
  1 1 2 1 3 2 3   3   4 
"1" "1" "1" "2" "3" "4" 

You can then use the dictionary to replace existing level labels with new ones

levels(d$x) <- dict[levels(d$x)]
d$x
[1] 1 1 1 2 3 4
Levels: 1 2 3 4

Upvotes: 1

Joris C.
Joris C.

Reputation: 6234

There is also a dedicated function recode() in dplyr for this purpose:

library(dplyr)

## initial factor
x <- factor(c("1","1 2", "1 3", "2 3", "3", "4"))
levels(x)
#> [1] "1"   "1 2" "1 3" "2 3" "3"   "4"

## edited factor
recode(x, "1 2" = "1", "1 3" = "1", "2 3" = "2")
#> [1] 1 1 1 2 3 4
#> Levels: 1 2 3 4

P.S.: you should not edit your question in such a way that it invalidates (previously valid) answers.

Upvotes: 1

s_baldur
s_baldur

Reputation: 33498

Another option is turning back to character while modifying:

d$x <- as.character(d$x)
d$x <- factor(sub(" .+", "", d$x))

d$x
# [1] 1 1 1 2 3 4
# Levels: 1 2 3 4

Upvotes: 2

Ma&#235;l
Ma&#235;l

Reputation: 51914

You can use fct_collapse:

library(dplyr)
library(forcats)
d %>% 
  mutate(x = fct_collapse(x, 
                          "1" = c("1", "1 2", "1 3"),
                          "2" = c("2", "2 3")))
  x
1 1
2 1
3 1
4 2
5 3
6 4

Upvotes: 1

AndS.
AndS.

Reputation: 8110

How about this? You split the text by the space and then you unnest the lists to long format. This will work if there are many issues. This also assumes that there is a space that defines the error as per your example.

library(tidyverse)

d <-  data.frame(x = c("1","2", "3 4", "5", "6"))

d |>
  mutate(x = str_split(x, pattern = "\\s")) |>
  unnest_longer(x)
#> # A tibble: 6 x 1
#>   x    
#>   <chr>
#> 1 1    
#> 2 2    
#> 3 3    
#> 4 4    
#> 5 5    
#> 6 6

Edit based on comments: Here are two methods. One with tidyverse and one using base R.

library(tidyverse)
  
d <-  data.frame(x = c("1","2", "3 4", "5", "6"))

d |>
  mutate(x = str_remove(x, "\\s4$")) 
#>   x
#> 1 1
#> 2 2
#> 3 3
#> 4 5
#> 5 6

d$x[which(d$x == "3 4")] <- "3"
d
#>   x
#> 1 1
#> 2 2
#> 3 3
#> 4 5
#> 5 6

Another edit based on more info:

d = data.frame(x = c("1","1 2", "1 3", "2 3", "3", "4"))

d$x <- as.factor(gsub("(\\d+)\\s\\d+$", "\\1", d$x))

d
#>   x
#> 1 1
#> 2 1
#> 3 1
#> 4 2
#> 5 3
#> 6 4

levels(d$x)
#> [1] "1" "2" "3" "4"

Upvotes: 1

Related Questions