Reputation: 13025
I have a lot of strings, and each of which tends to have the following format: Ab_Cd-001234.txt
I want to replace it with 001234
. How can I achieve it in R?
Upvotes: 33
Views: 77855
Reputation: 1118
I'm adding this answer because it works regardless of what non-numeric characters you have in the strings you want to clean up, and because OP said that the string tends to follow the format "Ab_Cd-001234.txt", which I take to mean allows for variation.
Note that this answer takes all numeric characters from the string and keeps them together, so if the string were "4_Ab_Cd_001234.txt", your result would be "4001234".
If you're wanting to point your solution at a column in a dataframe you've got,
df$clean_column<-gsub("[^0-9]", "", df$dirty_column)
This is very similar to the answer here: https://stackoverflow.com/a/52729957/9731173.
Essentially what you are doing with my solution is replacing any non-numeric character with "", while the answer I've linked to replaces any character that is not numeric, - or .
Upvotes: 1
Reputation: 269556
gsub Remove prefix and suffix:
gsub(".*-|\\.txt$", "", x)
tools package Use file_path_sans_ext
from tools to remove extension and then use sub
to remove prefix:
library(tools)
sub(".*-", "", file_path_sans_ext(x))
strapplyc Extract the digits after - and before dot. See gsubfn home page for more info:
library(gsubfn)
strapplyc(x, "-(\\d+)\\.", simplify = TRUE)
Note that if it were desired to return a numeric we could use strapply
rather than strapplyc
like this:
strapply(x, "-(\\d+)\\.", as.numeric, simplify = TRUE)
Upvotes: 2
Reputation: 42283
The stringr package has lots of handy shortcuts for this kind of work:
# input data following @agstudy
data <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
# load library
library(stringr)
# prepare regular expression
regexp <- "[[:digit:]]+"
# process string
str_extract(data, regexp)
Which gives the desired result:
[1] "001234" "001234"
To explain the regexp a little:
[[:digit:]]
is any number 0 to 9
+
means the preceding item (in this case, a digit) will be matched one or more times
This page is also very useful for this kind of string processing: http://en.wikibooks.org/wiki/R_Programming/Text_Processing
Upvotes: 41
Reputation: 109864
You could use genXtract
from the qdap package. This takes a left character string and a right character string and extracts the elements between.
library(qdap)
genXtract("Ab_Cd-001234.txt", "-", ".txt")
Though I much prefer agstudy's answer.
EDIT Extending answer to match agstudy's:
x <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
genXtract(x, "-", ".txt")
# $`- : .txt1`
# [1] "001234"
#
# $`- : .txt2`
# [1] "001234"
Upvotes: 4
Reputation: 121568
Using gsub
or sub
you can do this :
gsub('.*-([0-9]+).*','\\1','Ab_Cd-001234.txt')
"001234"
you can use regexpr
with regmatches
m <- gregexpr('[0-9]+','Ab_Cd-001234.txt')
regmatches('Ab_Cd-001234.txt',m)
"001234"
EDIT the 2 methods are vectorized and works for a vector of strings.
x <- c('Ab_Cd-001234.txt','Ab_Cd-001234.txt')
sub('.*-([0-9]+).*','\\1',x)
"001234" "001234"
m <- gregexpr('[0-9]+',x)
> regmatches(x,m)
[[1]]
[1] "001234"
[[2]]
[1] "001234"
Upvotes: 30