Reputation: 11
I have a data frame where column "ID" has values like these: 1234567_GSM00298873 1238416_GSM90473673 98377829
In other words, some rows have 7 numbers followed by "_" followed by letters and numbers; other rows have just numbers
I want to remove the numbers and the underscore preceding the letters, without affecting the rows that have only number. I tried
dataframe$ID <- gsub("*_", "", dataframe$ID)
but that only removes the underscore. So I learned that * means zero or more. Is there a wildcard, and a repetition operator such that I can tell it to find the pattern "anything-seven-times-followed-by-_"? Thanks!
Upvotes: 0
Views: 3699
Reputation: 70722
Your regular expression syntax is incorrect. You have nothing preceding your repetition operator.
dataframe$ID <- gsub('[0-9]+_', '', dataframe$ID)
This matches any character of: 0
to 9
( 1
or more times ) that is preceded by an underscore.
Upvotes: 1
Reputation: 7654
A different method. If a string has an underscore, return from the underscore to the end of the string; if not, return the string.
ID <- c("1234567_GSM00298873", "1238416_GSM90473673", "98377829")
ifelse(grepl("_", ID), substr(x = ID, 9, nchar(ID)), ID)
Upvotes: 0
Reputation: 11942
The link http://marvin.cs.uidaho.edu/Handouts/regex.html could helps you.
"[0-9]*_"
will match numbers followed by '_'"[0-9]{7}_"
will match 7 numbers followed by '_'".{7}_"
will match 7 characters followed by '_'Upvotes: 0
Reputation: 89547
Something like this?:
dataframe$ID <- gsub("[0-9]+_", "", dataframe$ID)
Upvotes: 0