Reputation: 10626
I have string like this:
years<-c("20 years old", "1 years old")
I would like to grep only the numeric number from this vector. Expected output is a vector:
c(20, 1)
How do I go about doing this?
Upvotes: 171
Views: 293531
Reputation: 5204
Slight variation on some other very good answers:
years <- c("20 years old", "1 years old")
as.numeric(gsub("[^0-9]", "", years))
#> [1] 20 1
Created on 2023-07-24 with reprex v2.0.2
Here we use ^
at the beginning of the regex
to negate the pattern.
Upvotes: 0
Reputation: 41
I am interested in this question as it applies to extracting values from the base::summary()
function. Another option you might want to consider to extract values from a table is to build a function that takes any entry of your summary()
table and transforms it into a useful number. For example if you get:
(s <- summary(dataset))
sv_final_num_beneficiarios sv_pfam_rec sv_area_transf
Min. : 1.0 Min. :0.0000036 Min. :0.000004
1st Qu.: 67.5 1st Qu.:0.0286363 1st Qu.:0.010107
Median : 200.0 Median :0.0710803 Median :0.021865
Mean : 454.6 Mean :0.1140274 Mean :0.034802
3rd Qu.: 515.8 3rd Qu.:0.1527177 3rd Qu.:0.044234
Max. :17516.0 Max. :0.8217923 Max. :0.360924
you might want to extract that 1st Qu
for sv_pfam_rec
and for that read the 2nd row of the 2nd col. In order to get the formatted single value I made a function
s_extract <- function(summary_entry){
separate(as_tibble(summary_entry),
sep = ":",
col = value,
remove = FALSE,
into = c("bad", "good"))[[3]] %>%
as.numeric()
}
You just have to feed a summary entry, for example summary_entry = s[3,3]
to obtain the Median
of sv_area_transf
.
It is worth nothing that given that this function is based on separate()
it makes it easier to navigate certain cases in which the name of the variable also contains numbers
Upvotes: 0
Reputation: 47300
Using the package unglue we can do :
# install.packages("unglue")
library(unglue)
years<-c("20 years old", "1 years old")
unglue_vec(years, "{x} years old", convert = TRUE)
#> [1] 20 1
Created on 2019-11-06 by the reprex package (v0.3.0)
More info: https://github.com/moodymudskipper/unglue/blob/master/README.md
Upvotes: 6
Reputation: 388817
We can also use str_extract
from stringr
years<-c("20 years old", "1 years old")
as.integer(stringr::str_extract(years, "\\d+"))
#[1] 20 1
If there are multiple numbers in the string and we want to extract all of them, we may use str_extract_all
which unlike str_extract
returns all the macthes.
years<-c("20 years old and 21", "1 years old")
stringr::str_extract(years, "\\d+")
#[1] "20" "1"
stringr::str_extract_all(years, "\\d+")
#[[1]]
#[1] "20" "21"
#[[2]]
#[1] "1"
Upvotes: 31
Reputation: 886948
Update
Since extract_numeric
is deprecated, we can use parse_number
from readr
package.
library(readr)
parse_number(years)
Here is another option with extract_numeric
library(tidyr)
extract_numeric(years)
#[1] 20 1
Upvotes: 126
Reputation: 337
Extract numbers from any string at beginning position.
x <- gregexpr("^[0-9]+", years) # Numbers with any number of digits
x2 <- as.numeric(unlist(regmatches(years, x)))
Extract numbers from any string INDEPENDENT of position.
x <- gregexpr("[0-9]+", years) # Numbers with any number of digits
x2 <- as.numeric(unlist(regmatches(years, x)))
Upvotes: 8
Reputation: 8601
A stringr
pipelined solution:
library(stringr)
years %>% str_match_all("[0-9]+") %>% unlist %>% as.numeric
Upvotes: 26
Reputation: 438
After the post from Gabor Grothendieck post at the r-help mailing list
years<-c("20 years old", "1 years old")
library(gsubfn)
pat <- "[-+.e0-9]*\\d"
sapply(years, function(x) strapply(x, pat, as.numeric)[[1]])
Upvotes: 5
Reputation: 15395
I think that substitution is an indirect way of getting to the solution. If you want to retrieve all the numbers, I recommend gregexpr
:
matches <- regmatches(years, gregexpr("[[:digit:]]+", years))
as.numeric(unlist(matches))
If you have multiple matches in a string, this will get all of them. If you're only interested in the first match, use regexpr
instead of gregexpr
and you can skip the unlist
.
Upvotes: 75
Reputation: 38619
Here's an alternative to Arun's first solution, with a simpler Perl-like regular expression:
as.numeric(gsub("[^\\d]+", "", years, perl=TRUE))
Upvotes: 41
Reputation: 109844
You could get rid of all the letters too:
as.numeric(gsub("[[:alpha:]]", "", years))
Likely this is less generalizable though.
Upvotes: 18
Reputation: 118779
How about
# pattern is by finding a set of numbers in the start and capturing them
as.numeric(gsub("([0-9]+).*$", "\\1", years))
or
# pattern is to just remove _years_old
as.numeric(gsub(" years old", "", years))
or
# split by space, get the element in first index
as.numeric(sapply(strsplit(years, " "), "[[", 1))
Upvotes: 130