Reputation: 1437
I have a string vector. I would like to extract a number after "# of Stalls: " The numbers are located either in the middle or in the end of the string.
x <- c("1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/># of Stalls: 244<br/>Cost: Free", "20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/># of Stalls: 40")
Here is my trial, but it is not sufficient. I appreciate your help.
gsub(".*\\# of Stalls: ", "", x)
Upvotes: 3
Views: 2999
Reputation: 269431
Here are some solutions. (1) and (1a) are variations of the code in the question. (2) and (2a) take the opposite approach where, instead of removing what we don't want they match what we do want.
1) gsub The code in the question removes the portion behfore the number but does not remove the portion after. We can modify it to do both at once below. The |\\D.*$
part that we added does that. Note that "\\D"
matches any non-digit.
as.integer(gsub(".*# of Stalls: |\\D.*$", "", xx))
## [1] 244 40
1a) sub Alernately do these in two separate sub
calls. The inner sub is from the question and the outer sub
removes the first non-numeric onwards after the number.
as.integer(sub("\\D.*$", "", sub(".*# of Stalls: ", "", xx)))
## [1] 244 40
2) strcapture With this approach, available in the development version of R, we can simplify the regular expression substantially. We specify a match with a capture group (portion in parentheses). strcapture
will return the portion corresponding to the capture group and create a data.frame from it. The third argument is a prototype structure that it uses to know that it is supposed to return integers. Note that "\\d"
matches any digit.
strcapture("# of Stalls: (\\d+)", xx, list(stalls = integer()))
## stalls
## 1 244
## 2 40
2a) strapply The strapply function in the gsubfn package is similar to strcapture
but uses an apply paradigm where the first argument is the input string, the second is the pattern and the third is the function to apply to the capture group.
library(gsubfn)
strapply(xx, "# of Stalls: (\\d+)", as.integer, simplify = TRUE)
## [1] [1] 244 40
Note: The input xx
used is the same as x
in the question:
xx <- c("1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/># of Stalls: 244<br/>Cost: Free",
"20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/># of Stalls: 40"
)
Upvotes: 4
Reputation: 43334
Since it's HTML, you can use rvest or another HTML parser to extract the nodes you want first, which makes extracting the numbers trivial. XPath selectors and functions afford a little more flexibility than CSS ones for this sort of work.
library(rvest)
x %>% paste(collapse = '<br/>') %>%
read_html() %>%
html_nodes(xpath = '//text()[contains(., "# of Stalls:")]') %>%
html_text() %>%
readr::parse_number()
#> [1] 244 40
Upvotes: 7
Reputation: 11128
There are many ways to solve this problem, I am going to use stringr
package to solve it. The first str_extract
would fetch the values :
[1] "# of Stalls: 244" "# of Stalls: 40" and then the second str_extract
extracts the only digit parts available in the string.
I am however not clear whether you want to extract the string or replace the string. In case you want extarct the string below would work for you. In case you want to replace the string then you need to use str_replace
library(stringr)
as.integer(str_extract(str_extract(x,"#\\D*\\d{1,}"),"\\d{1,}"))
In case you want to replace the string then you should do :
str_replace(x,"#\\D*(\\d{1,})","\\1")
Output:
Output for extract:
> as.integer(str_extract(str_extract(x,"#\\D*\\d{1,}"),"\\d{1,}"))
[1] 244 40
Output for replace:
> str_replace(x,"#\\D*(\\d{1,})","\\1")
[1] "1345 W. Pacific Coast Highway<br/>Wilmington 90710<br/><br/>County: Los Angeles<br/>Date Updated: 6/25/2013<br/>Latitude:-118.28079400<br/>Longitude:33.79077900<br/>244<br/>Cost: Free"
[2] "20601 La Puente Ave<br/>Walnut 91789<br/>County: Los Angeles<br/>Date Updated: 6/18/2007<br/>Latitude: -117.859972<br/>Longitude: 34.017513<br/>Owner: Church<br/>Operator: Caltrans<br/>40"
Upvotes: 4
Reputation: 886938
We match one or more characters that are not a #
([^#]+
) from the start (^
) of the string followed by a #
followed by zero or more characters that are not a number ([^0-9]*
) followed by one or more numbers ([0-9]+
) captured as a group ((...)
), followed by other characters (.*
) and replace it with the backreference (\\1
) of the captured group
as.integer(sub("^[^#]+#[^0-9]*([0-9]+).*", "\\1", x))
#[1] 244 40
If the string is more specific, then we can specify it
as.integer(sub("^[^#]+# of Stalls:\\s+([0-9]+).*", "\\1", x))
#[1] 244 40
Upvotes: 6