Reputation: 11

remove characters from a string in a data frame

I have a data frame where column "ID" has values like these: 1234567_GSM00298873 1238416_GSM90473673 98377829

In other words, some rows have 7 numbers followed by "_" followed by letters and numbers; other rows have just numbers

I want to remove the numbers and the underscore preceding the letters, without affecting the rows that have only number. I tried

dataframe$ID <- gsub("*_", "", dataframe$ID)

but that only removes the underscore. So I learned that * means zero or more. Is there a wildcard, and a repetition operator such that I can tell it to find the pattern "anything-seven-times-followed-by-_"? Thanks!

Upvotes: 0

Answers (4)

hwnd

Reputation: 70750

Your regular expression syntax is incorrect. You have nothing preceding your repetition operator.

dataframe$ID <- gsub('[0-9]+_', '', dataframe$ID)

This matches any character of: 0 to 9 ( 1 or more times ) that is preceded by an underscore.

Working Demo

Upvotes: 1

lawyeR

Reputation: 7684

A different method. If a string has an underscore, return from the underscore to the end of the string; if not, return the string.

ID <- c("1234567_GSM00298873", "1238416_GSM90473673", "98377829")
ifelse(grepl("_", ID), substr(x = ID, 9, nchar(ID)), ID)

Upvotes: 0

mpromonet

Reputation: 11963

The link http://marvin.cs.uidaho.edu/Handouts/regex.html could helps you.

"[0-9]*_" will match numbers followed by '_'
"[0-9]{7}_" will match 7 numbers followed by '_'
".{7}_" will match 7 characters followed by '_'

Upvotes: 0

Casimir et Hippolyte

Reputation: 89639

Something like this?:

 dataframe$ID <- gsub("[0-9]+_", "", dataframe$ID)

Upvotes: 0

remove characters from a string in a data frame

Answers (4)

Related Questions