user6677073
user6677073

Reputation:

Replace strings/characters in data frame column

I have a data frame ("GO") in R with 2 columns, "term" and "gene". The "term" column is of type character and has entries like this:

GO_POSITIVE_REGULATION_OF_VIRAL_TRANSCRIPTION

GO_CARGO_RECEPTOR

GO_MATRIX ...

So every column starts with GO_ and has _ between the words. I want to delete the GO_ and replace the other _ by spaces.

I tried to fix this with gsub:

GO$term <- gsub('GO', '', GO$term)
GO$term <- gsub('\\_', ' ', GO$term)

The problem is that for example GO_CARGO_RECEPTOR has become CAR RECEPTOR, but I need it to be CARGO RECEPTOR.

I don't know how it is possible to specify the code in R, so that in this example only the GO_ in the beginning and the _ in the middle of the strings are deleted...

Thanks for any help.

Upvotes: 0

Views: 1037

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626835

Just in case you need to only replace _ with spaces in strings that start with a specific prefix and drop this prefix, too, you may use a PCRE regex based gsub like

x <- c("GO_POSITIVE_REGULATION_OF_VIRAL_TRANSCRIPTION","POSITIVE_REGULATION_OF_VIRAL_TRANSCRIPTION")
gsub("(?:\\G(?!^)|^GO_)([^_]*)_", "\\1 ", x, perl=TRUE)
## => [1] "POSITIVE REGULATION OF VIRAL TRANSCRIPTION"
##    [2] "POSITIVE_REGULATION_OF_VIRAL_TRANSCRIPTION"

See the R demo and the regex demo.

Regex details

  • (?:\G(?!^)|^GO_) - A non-capturing group that matches either the end of the preceding match (\G(?!^)) or (|) the GO_ substring (prefix) at the start of a line
  • ([^_]*) - Capturing group 1 (this value is referred to with \1 from the replacement pattern): any 0 or more chars other than _
  • _ - an underscore.

Upvotes: 0

bbiasi
bbiasi

Reputation: 1599

With dplyr::mutate plus some base functions to do manipulation in the data frame.

library(dplyr)
GO <- GO %>% 
  dplyr::mutate(term = base::substring(term, 4), # remove GO_
                term = base::gsub("_", " ", term))
> GO
                                        term     gene
1 POSITIVE REGULATION OF VIRAL TRANSCRIPTION 0.507617
2                             CARGO RECEPTOR 0.991978
3                                     MATRIX 0.543001

  • Data
GO <- data.frame(term = c("GO_POSITIVE_REGULATION_OF_VIRAL_TRANSCRIPTION",
                          "GO_CARGO_RECEPTOR",
                          "GO_MATRIX"),
                 gene = runif(3))

Upvotes: 0

DSGym
DSGym

Reputation: 2867

x <- "GO_CARGO_RECEPTOR"

gsub("_", " ", sub("^GO_", "", x))
[1] "CARGO RECEPTOR"

Just use sub instead of gsub for the "GO_" and gsub for the rest.

Upvotes: 1

Related Questions