Reputation: 75

How to use regex over entire dataframe in R

new user to R so please go easy on me.

I have dataframe like:

   df = data.frame(Mineral = c("Zfeldspar", "Zgranite", "ZSilica"),
                     Confidence = c("ZLow", "High", "Med"),
                     Coverage = c("sub", "sub", "super"),
                     Aspect = c("ZPos", "ZUnd", "Neg"))

actual file is much larger and outputted from old hardware. For some reason some entries have "Z" put in front of them. How do I remove from entire dataset?

I tried df = gsub("Z", " ", df) but it just gives me nonsense. This darn thing!

[1] "1:3" "c(3, 1, 2)" "c(1, 1, 2)" "c(2, 3, 1)"

Looked on here at stackoverflow and tried stringr package but could also not get to work. Anyone know what to do?

Upvotes: 2

Answers (5)

Marcus Campbell

Reputation: 2796

Your approach with gsub() is not working because that function operates on vectors, and not dataframes. However, you can apply gsub() over each column of your dataframe to get what you want:

df[] <- lapply(df, function (x) {gsub("Z", "", x)})

For a stringr solution (that also uses dplyr), try:

library(tidyverse)

df <- mutate_all(df,
                   funs(str_replace_all(., "Z", "")))

P.S. I recommend using df <- instead of df = in the future. Good luck!

EDIT: corrected typo - thanks @thelatemail

Upvotes: 4

smci

Reputation: 33940

You asked how to do it in stringr(/stringi) package, to avoid getting the unwanted vector of indices you got:

> as.data.frame(apply(df, 2,
      function(col) stringr::str_replace_all(col, '^Z', '')))
> as.data.frame(apply(df, 2,
      function(col) stringi::stri_replace_first_regex(col, '^Z', '')))

   Mineral Confidence Coverage Aspect
1 feldspar        Low      sub    Pos
2  granite       High      sub    Und
3   Silica        Med    super    Neg

(where the as.data.frame() call is needed to turn the output array back into a df R: apply-like function that returns a data frame? )

As to figuring out how exactly to call str*_replace function over an entire dataframe, I tried...

the entire df: stri_replace_first_fixed(df, '^Z', '')
by rows: stri_replace_first_fixed(df[1,], '^Z', '')
by columns: stri_replace_first_fixed(df[,1], '^Z', '')

Only the last one works properly. Admittedly a design flaw on str*_replace, they should at minimum recognize an invalid object and produce a useful error message, instead of spewing out indices.

Upvotes: 0

Wiktor Stribiżew

Reputation: 626748

You may use a simple ^Z regex in the following way:

df = data.frame(Mineral = c("Zfeldspar", "Zgranite", "ZSilica"),
                      Confidence = c("ZLow", "High", "Med"),
                      Coverage = c("sub", "sub", "super"),
                      Aspect = c("ZPos", "ZUnd", "Neg"))
df[] <- lapply(df, sub, pattern = '^Z',  replacement ="")
> df
   Mineral Confidence Coverage Aspect
1 feldspar        Low      sub    Pos
2  granite       High      sub    Und
3   Silica        Med    super    Neg

The ^Z pattern matches the start of the string with ^ anchor, and then Z is matched and removed using sub (as there is only one possible match in the each string there is no point using gsub).

Upvotes: 1

Lennyy

Reputation: 6132

You could do:

as.data.frame(sapply(data, function(x) {gsub("Z", "", x)}))

Upvotes: 0

Matias Andina

Reputation: 4220

You are close. If you want to go with base gsub

data$Mineral = gsub("Z", "", data$Mineral)

You can do this for all columns. Or use a combination of apply strategies (see other answers!)

PS. Naming your data data is not a good idea. At least do my_data

Upvotes: 0

How to use regex over entire dataframe in R

Answers (5)

Related Questions