Reputation: 91
I have this problem. My dataset a contains one column badly formatted containing characters, letters and punctuation. I would like to separate column Unit_Wrong in two columns num and text.
Here is dataset a:
a <- data.frame(Measure = c(10000, 2000, 10000, 15000, 40000, 0),
Unit_Wrong = c("10L","25.5mL","30.5 mL","40OUNCES","3X", "NO_SIZE"),
stringsAsFactors = FALSE)
My expected outcome is b:
b <- data.frame(Measure = c(10000, 2000, 10000, 15000, 40000, 0),
Unit_Wrong = c("10L","25.5mL","30.5 mL","40OUNCES","3X", "NO_SIZE"),
text = c("L", "mL", "ml", "OUNCES", "X", "NO_SIZE"),
num = c("10","25.5","30.5","40","3", ""),
stringsAsFactors = FALSE)
I tried with this but it doesn't work:
attempt <- a %>%
mutate(text = gsub("[[:digit:]]","", Unit_Wrong)) %>%
mutate(num = str_replace_all(Unit_Wrong, text, ""))
Can you help?
Upvotes: 1
Views: 62
Reputation: 61154
Here's an R base soluction using gsub
> text <- gsub("\\d*\\s*\\.*", "", a$Unit_Wrong)
> num <- as.numeric(gsub("\\s*[[A-Za-z]]*_*", "", a$Unit_Wrong))
> data.frame(a, text, num)
Measure Unit_Wrong text num
1 10000 10L L 10.0
2 2000 25.5mL mL 25.5
3 10000 30.5 mL mL 30.5
4 15000 40OUNCES OUNCES 40.0
5 40000 3X X 3.0
6 0 NO_SIZE NO_SIZE NA
Upvotes: 1
Reputation: 11480
a %>%
mutate(text = stringr::str_extract(Unit_Wrong,"[A-z]+$")) %>%
mutate(num = stringr::str_extract(Unit_Wrong,"(\\d\\.?)+") %>% as.numeric)
Output:
Measure Unit_Wrong text num
1 10 10L L 10
2 2000 25.5mL mL 25.5
3 10000 30.5 mL mL 30.5
4 15 40OUNCES OUNCES 40
5 40 3X X 3
6 0 NO_SIZE NO_SIZE <NA>
Note:
If you have special characters for unit like "µ" etc. You need to add those in
inside [A-z]
like [A-zµ]
and so on.
Upvotes: 3