Davide Bottoli
Davide Bottoli

Reputation: 91

Separate values from characters in R string

I have this problem. My dataset a contains one column badly formatted containing characters, letters and punctuation. I would like to separate column Unit_Wrong in two columns num and text.

Here is dataset a:

a <- data.frame(Measure = c(10000, 2000, 10000, 15000, 40000, 0), 
                Unit_Wrong = c("10L","25.5mL","30.5 mL","40OUNCES","3X", "NO_SIZE"), 
                stringsAsFactors = FALSE)

My expected outcome is b:

b <- data.frame(Measure = c(10000, 2000, 10000, 15000, 40000, 0), 
                Unit_Wrong = c("10L","25.5mL","30.5 mL","40OUNCES","3X", "NO_SIZE"), 
                text = c("L", "mL", "ml", "OUNCES", "X", "NO_SIZE"),
                num = c("10","25.5","30.5","40","3", ""),
                stringsAsFactors = FALSE) 

I tried with this but it doesn't work:

attempt <- a %>% 
  mutate(text = gsub("[[:digit:]]","", Unit_Wrong)) %>%
  mutate(num = str_replace_all(Unit_Wrong, text, ""))

Can you help?

Upvotes: 1

Views: 62

Answers (2)

Jilber Urbina
Jilber Urbina

Reputation: 61154

Here's an R base soluction using gsub

> text <- gsub("\\d*\\s*\\.*", "", a$Unit_Wrong)
> num <- as.numeric(gsub("\\s*[[A-Za-z]]*_*", "", a$Unit_Wrong))
> data.frame(a, text, num)
  Measure Unit_Wrong    text  num
1   10000        10L       L 10.0
2    2000     25.5mL      mL 25.5
3   10000    30.5 mL      mL 30.5
4   15000   40OUNCES  OUNCES 40.0
5   40000         3X       X  3.0
6       0    NO_SIZE NO_SIZE   NA

Upvotes: 1

Andre Elrico
Andre Elrico

Reputation: 11480

    a %>%
mutate(text = stringr::str_extract(Unit_Wrong,"[A-z]+$")) %>%
mutate(num  = stringr::str_extract(Unit_Wrong,"(\\d\\.?)+") %>% as.numeric)

Output:

  Measure Unit_Wrong    text  num
1      10        10L       L   10
2    2000     25.5mL      mL 25.5
3   10000    30.5 mL      mL 30.5
4      15   40OUNCES  OUNCES   40
5      40         3X       X    3
6       0    NO_SIZE NO_SIZE <NA>

Note:

If you have special characters for unit like "µ" etc. You need to add those in inside [A-z] like [A-zµ] and so on.

Upvotes: 3

Related Questions