POTENZA
POTENZA

Reputation: 1437

extract a number in the middle or end of a string in R

I have a string vector. I would like to extract a number after "# of Stalls: " The numbers are located either in the middle or in the end of the string.

x <- c("1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/># of Stalls: 244<br/>Cost: Free", "20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/># of Stalls: 40")

Here is my trial, but it is not sufficient. I appreciate your help.

gsub(".*\\# of Stalls: ", "", x) 

Upvotes: 3

Views: 2999

Answers (4)

G. Grothendieck
G. Grothendieck

Reputation: 269431

Here are some solutions. (1) and (1a) are variations of the code in the question. (2) and (2a) take the opposite approach where, instead of removing what we don't want they match what we do want.

1) gsub The code in the question removes the portion behfore the number but does not remove the portion after. We can modify it to do both at once below. The |\\D.*$ part that we added does that. Note that "\\D" matches any non-digit.

as.integer(gsub(".*# of Stalls: |\\D.*$", "", xx))
## [1] 244  40

1a) sub Alernately do these in two separate sub calls. The inner sub is from the question and the outer sub removes the first non-numeric onwards after the number.

as.integer(sub("\\D.*$", "", sub(".*# of Stalls: ", "", xx)))
## [1] 244  40

2) strcapture With this approach, available in the development version of R, we can simplify the regular expression substantially. We specify a match with a capture group (portion in parentheses). strcapture will return the portion corresponding to the capture group and create a data.frame from it. The third argument is a prototype structure that it uses to know that it is supposed to return integers. Note that "\\d" matches any digit.

strcapture("# of Stalls: (\\d+)", xx, list(stalls = integer()))
##   stalls
## 1    244
## 2     40

2a) strapply The strapply function in the gsubfn package is similar to strcapture but uses an apply paradigm where the first argument is the input string, the second is the pattern and the third is the function to apply to the capture group.

library(gsubfn)

strapply(xx, "# of Stalls: (\\d+)", as.integer, simplify = TRUE)
## [1] [1] 244  40

Note: The input xx used is the same as x in the question:

xx <- c("1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/># of Stalls: 244<br/>Cost: Free", 
"20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/># of Stalls: 40"
)

Upvotes: 4

alistaire
alistaire

Reputation: 43334

Since it's HTML, you can use rvest or another HTML parser to extract the nodes you want first, which makes extracting the numbers trivial. XPath selectors and functions afford a little more flexibility than CSS ones for this sort of work.

library(rvest)

x %>% paste(collapse = '<br/>') %>% 
    read_html() %>% 
    html_nodes(xpath = '//text()[contains(., "# of Stalls:")]') %>% 
    html_text() %>% 
    readr::parse_number()
#> [1] 244  40

Upvotes: 7

PKumar
PKumar

Reputation: 11128

There are many ways to solve this problem, I am going to use stringr package to solve it. The first str_extract would fetch the values : [1] "# of Stalls: 244" "# of Stalls: 40" and then the second str_extract extracts the only digit parts available in the string.

I am however not clear whether you want to extract the string or replace the string. In case you want extarct the string below would work for you. In case you want to replace the string then you need to use str_replace

library(stringr)
as.integer(str_extract(str_extract(x,"#\\D*\\d{1,}"),"\\d{1,}"))

In case you want to replace the string then you should do :

str_replace(x,"#\\D*(\\d{1,})","\\1")

Output:

Output for extract:

 > as.integer(str_extract(str_extract(x,"#\\D*\\d{1,}"),"\\d{1,}"))
    [1] 244  40

Output for replace:

> str_replace(x,"#\\D*(\\d{1,})","\\1")
[1] "1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/>244<br/>Cost: Free"    
[2] "20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/>40"

Upvotes: 4

akrun
akrun

Reputation: 886938

We match one or more characters that are not a # ([^#]+) from the start (^) of the string followed by a # followed by zero or more characters that are not a number ([^0-9]*) followed by one or more numbers ([0-9]+) captured as a group ((...)), followed by other characters (.*) and replace it with the backreference (\\1) of the captured group

as.integer(sub("^[^#]+#[^0-9]*([0-9]+).*", "\\1", x))
#[1] 244  40

If the string is more specific, then we can specify it

as.integer(sub("^[^#]+# of Stalls:\\s+([0-9]+).*", "\\1", x))
#[1] 244  40

Upvotes: 6

Related Questions