Sophia L
Sophia L

Reputation: 21

How to replace cells if they contain part of a string in R

I have columns with different ratings from 1-5 with descriptors next to the number. The format is "number dash descriptor", ex. "1 - very happy" or "5 - hungry". I want to replace these with just the number, but there are a lot of different descriptors and too many to recode all manually.

Because they all include a dash, I'm sure there must be a way to do something like replace all instances of cells that contain "1 -" with "1", but I can't seem to make anything simple work.

Any help is appreciated!

I can use str_contains to find cells that contain a dash, but can't make that work with replace in dplyr.

Upvotes: 1

Views: 196

Answers (3)

To extract numbers from text strings in R, I would use the {stringr} package.

First, lets reproduce your data in a simple dataframe:

library(dplyr)

data <- tibble("values" = c("1 - very happy", "5 - hungry", "3 - average"))

We can use str_extract from the {stringr} package to extract the first single character from a string, using the regex syntax for any character (.) at the beginning of the string (^):

install.packages("stringr")
library(stringr)
data |> 
  mutate(numbers = stringr::str_extract(values, "^.") |> as.numeric())

But this won't work if there are numbers with more than a single digit. So, we can use regex for any number of any length (\\d+) in str_extract to extract only numbers from a string, no matter in which part of the string they are in:

data |> 
  mutate(numbers = stringr::str_extract(values, "\\d+") |> as.numeric())

This method allows us to also find any number that is before a dash symbol:

data |> 
  mutate(numbers = stringr::str_extract(values, "\\d+ -"),
         numbers = stringr::str_remove(numbers, " -") |> as.numeric()) 

Note that we have to remove the dash afterwards. This can be avoided using what is called I regex a positive lookahead, that is, find things that match the criteria but also come before other things, like extracting any number that comes before a space and dash symbols:

data |> 
  mutate(numbers = stringr::str_extract(values, "\\d+(?= -)") |> as.numeric())

Finally, other packages such as {readr} have functions that help with these kind of data cleaning tasks:

data |> 
  mutate(numbers = readr::parse_number(values))

Upvotes: 1

Andre Wildberg
Andre Wildberg

Reputation: 19088

An approach with sub, making sure only numbers are considered by using a capture group on the digit(s) (\\d+) at the beginning of the string (^).

library(dplyr)

df %>% 
  mutate(desc_new = as.numeric(sub("(^\\d+) - .*", "\\1", desc)))
# A tibble: 2 × 2
  desc           desc_new
  <chr>             <dbl>
1 1 - very happy        1
2 5 - hungry            5

Data

df <- structure(list(desc = c("1 - very happy", "5 - hungry")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -2L))

Upvotes: 1

Joy
Joy

Reputation: 119

Since we know that ' -' is assured in all cells, why not just split the string by ' -' and take the first element instead of doing the regex method which can potential match "a lot of different descriptors" mentioned.

> df<-data.frame(number_dash_descriptor=c('1 - very happy','4 - fafdf132321)(*&^%$#','5 - hungry'))
> df
   number_dash_descriptor
1          1 - very happy
2 4 - fafdf132321)(*&^%$#
3              5 - hungry
> df%>%mutate(number_dash_descriptor= str_split(number_dash_descriptor, ' -')%>%sapply("[[",1))
  number_dash_descriptor
1                      1
2                      4
3                      5

Upvotes: 2

Related Questions