MsGISRocker
MsGISRocker

Reputation: 642

set na all values that starts with certain string in dplyr environment is.na(), na_if(), startsWith(), regex

I have data frame of several character and numerical variables. I want to set all values starting with 0/n( to NA . Theses values may be spread over several columns.

Example:

string <- c("asff", "1\n(", '0asfd', '0\n(asdf)')
num <- c(0,1,2,3)
df <- data.frame(string, num)
##
df
     string num
1      asff   0
2      1\n(   1
3     0asfd   2
4 0\n(asdf)   3

Desired output:

     string num
1      asff   NA
2      1\n(   1
3        NA   2
4        NA   3

The numerical variables are quickly set to NA by:

df%>%na_if(., 0)

But the character string doesn't seem to work as easy with na_if(), as na_if() does not seem to work with regex, does it? I tried things like: na_if(., "^0.") na_if(., "0.") na_if(., startsWith("0")), but ERROR ...

Maybe a conditionaldplyr::mutate(.=ifelse()) with a regex, but then I have these line breaks "\n)", which or not captured by "." in a regex. Or something with mutate(x = str_replace_all()).

I can do it column by column (df%>%mutate(x = ifelse(startsWith(x, "0"), NA,x )))), but that's too much hard coding. I can not apply it to all columns with mutate(across(colnames(), ~ifelse(startsWith(., "0"), NA, .))) as the numerical columns trigger an error.

Upvotes: 3

Views: 1967

Answers (1)

hugh-allan
hugh-allan

Reputation: 1370

If you're able to do it for one column using mutate, you should be able to do it for multiple columns using mutate_at() or mutate_all(), explained here: https://dplyr.tidyverse.org/reference/mutate_all.html

Without knowing what your data looks like, I think you'd want mutate_all() to modify all columns which have data which matches your condition.

In this example using the iris dataset, we replace all instances of 5 with the word five:

iris %>%
  tibble %>%
  mutate_all(function(x) str_replace(x, '5', 'five'))

# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
   <chr>        <chr>       <chr>        <chr>       <chr>  
 1 five.1       3.five      1.4          0.2         setosa 
 2 4.9          3           1.4          0.2         setosa 
 3 4.7          3.2         1.3          0.2         setosa 
 4 4.6          3.1         1.five       0.2         setosa 
 5 five         3.6         1.4          0.2         setosa 
 6 five.4       3.9         1.7          0.4         setosa 
 7 4.6          3.4         1.4          0.3         setosa 
 8 five         3.4         1.five       0.2         setosa 
 9 4.4          2.9         1.4          0.2         setosa 
10 4.9          3.1         1.five       0.1         setosa 

Or like your condition, we can do this only when the string starts with 5, using ^5 regex language (^ indicates the start of the string, and 5 means a 5 at the beginning of the string).

iris %>%
  tibble %>%
  mutate_all(function(x) str_replace(x, '^5', 'five'))

# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
   <chr>        <chr>       <chr>        <chr>       <chr>  
 1 five.1       3.5         1.4          0.2         setosa 
 2 4.9          3           1.4          0.2         setosa 
 3 4.7          3.2         1.3          0.2         setosa 
 4 4.6          3.1         1.5          0.2         setosa 
 5 five         3.6         1.4          0.2         setosa 
 6 five.4       3.9         1.7          0.4         setosa 
 7 4.6          3.4         1.4          0.3         setosa 
 8 five         3.4         1.5          0.2         setosa 
 9 4.4          2.9         1.4          0.2         setosa 
10 4.9          3.1         1.5          0.1         setosa 

Update To change the entire value, if it has a 5 at the start, you just need to change the str_replace function to a function which can change the entire value. In this case, we use an ifelse statement

iris %>%
  tibble %>%
  mutate_all(function(x) ifelse(str_detect(x, '^5'), 'had_five', x))

# A tibble: 150 x 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
   <chr>              <dbl> <chr>              <dbl>   <int>
 1 had_five             3.5 1.4                  0.2       1
 2 4.9                  3   1.4                  0.2       1
 3 4.7                  3.2 1.3                  0.2       1
 4 4.6                  3.1 1.5                  0.2       1
 5 had_five             3.6 1.4                  0.2       1
 6 had_five             3.9 1.7                  0.4       1
 7 4.6                  3.4 1.4                  0.3       1
 8 had_five             3.4 1.5                  0.2       1
 9 4.4                  2.9 1.4                  0.2       1
10 4.9                  3.1 1.5                  0.1       1

Another update From your comments, it sounds like you want to apply the function to only character columns. To do this, you can substitute mutate_all(your_fun) for mutate_if(is.character, your_fun) - as described in the help documentation at the start of this answer (the same info page describes mutate_all, mutate_if and mutate_at).

Using your sample data as an example, we can set anything beginning with '0' to NA. I am confused by your example though - do you want to look for '0' or '0\n(' at the start of the string? Either way, this is how to do it:

# sample data
string <- c("asff", "1\n(", '0asfd', '0\n(asdf)')
num <- c(0,1,2,3)
df <- data.frame(string, num)


# for only a 0 at the start of the string
df %>% 
  mutate_if(is.character, function(x) ifelse(str_detect(x, '^0'), NA, x))

  string num
1   asff   0
2   1\n(   1
3   <NA>   2
4   <NA>   3


# for '0\n(' at the start of the string
df %>% 
  mutate_if(is.character, function(x) ifelse(str_detect(x, '^0\\n\\('), NA, x))

  string num
1   asff   0
2   1\n(   1
3  0asfd   2
4   <NA>   3

Upvotes: 4

Related Questions