Reputation: 1893
I have a URL like
https://endpoint/v1.0/album/id/photo/id/
where endpoint
is a variable. I want to extract "/v1.0/album/id/photo/id/
".
How do I extract everything after "endpoint" using a Ruby regular expression?
Upvotes: 3
Views: 1950
Reputation: 160553
The URI RFC documents the pattern used to parse a URL:
Appendix B. Parsing a URI Reference with a Regular Expression
As the "first-match-wins" algorithm is identical to the "greedy"
disambiguation method used by POSIX regular expressions, it is
natural and commonplace to use a regular expression for parsing the
potential five components of a URI reference.
The following line is the regular expression for breaking-down a
well-formed URI reference into its components.
Berners-Lee, et al. Standards Track [Page 50]
RFC 3986 URI Generic Syntax January 2005
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
The numbers in the second line above are only to assist readability;
they indicate the reference points for each subexpression (i.e., each
paired parenthesis). We refer to the value matched for subexpression
<n> as $<n>. For example, matching the above expression to
http://www.ics.uci.edu/pub/ietf/uri/#Related
results in the following subexpression matches:
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related
where <undefined> indicates that the component is not present, as is
the case for the query component in the above example. Therefore, we
can determine the value of the five components as
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
Based on that:
URL_REGEX = %r!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?!
'https://endpoint/v1.0/album/id/photo/id/'.match(URL_REGEX).captures
# => ["https:",
# "https",
# "//endpoint",
# "endpoint",
# "/v1.0/album/id/photo/id/",
# nil,
# nil,
# nil,
# nil]
'https://endpoint/v1.0/album/id/photo/id/'.match(URL_REGEX).captures[4]
# => "/v1.0/album/id/photo/id/"
Upvotes: 0
Reputation: 74620
The full regex solution is what the URI library does in the background. Doing it on your own is largely an exercise in futility...
In any case, a simple regex using named capture groups (?<name>)
and the /x
flag on the end to allow whitespace in the formatting.
url = 'https://endpoint/v1.0/album/id/photo/id/'
re = /
^ # beginning of string
(?<scheme> https? ) # http or s
:\/\/ # seperator
(?<domain> [[a-zA-Z0-9]\.-]+? ) # many alnum, -'s or .'s
(?<path> \/.+ ) # forward slash on is the path
/x
res = url.match re
res[:path] if res
This pales in comparison to URI
Upvotes: 1
Reputation: 1
Here's a regex solution:
domain = 'endpoint'
link = "https://#{domain}/v1.0/album/id/photo/id/"
path = link.gsub("https://#{domain}", '')
# => "/v1.0/album/id/photo/id/"
You can adjust the domain name by changing the "domain" variable. I used the String.gsub function to replace first portion of your link with an empty string (the regular expression part done on line 3 is actually surprisingly simple! It's literally http:// endpoint), which means the path is the only part of the string that will remain.
Upvotes: 0
Reputation: 118261
here we go:
2.0.0-p451 :001 > require 'uri'
=> true
2.0.0-p451 :002 > URI('https://endpoint/v1.0/album/id/photo/id/').path
=> "/v1.0/album/id/photo/id/"
2.0.0-p451 :003 >
Read this Basic example.
Upvotes: 5