user1788294
user1788294

Reputation: 1893

Ruby regular expression - Extracting part of URL

I have a URL like

https://endpoint/v1.0/album/id/photo/id/

where endpoint is a variable. I want to extract "/v1.0/album/id/photo/id/".

How do I extract everything after "endpoint" using a Ruby regular expression?

Upvotes: 3

Views: 1950

Answers (4)

the Tin Man
the Tin Man

Reputation: 160553

The URI RFC documents the pattern used to parse a URL:

Appendix B.  Parsing a URI Reference with a Regular Expression

   As the "first-match-wins" algorithm is identical to the "greedy"
   disambiguation method used by POSIX regular expressions, it is
   natural and commonplace to use a regular expression for parsing the
   potential five components of a URI reference.

   The following line is the regular expression for breaking-down a
   well-formed URI reference into its components.



Berners-Lee, et al.         Standards Track                    [Page 50]
 
RFC 3986                   URI Generic Syntax               January 2005


      ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
       12            3  4          5       6  7        8 9

   The numbers in the second line above are only to assist readability;
   they indicate the reference points for each subexpression (i.e., each
   paired parenthesis).  We refer to the value matched for subexpression
   <n> as $<n>.  For example, matching the above expression to

      http://www.ics.uci.edu/pub/ietf/uri/#Related

   results in the following subexpression matches:

      $1 = http:
      $2 = http
      $3 = //www.ics.uci.edu
      $4 = www.ics.uci.edu
      $5 = /pub/ietf/uri/
      $6 = <undefined>
      $7 = <undefined>
      $8 = #Related
      $9 = Related

   where <undefined> indicates that the component is not present, as is
   the case for the query component in the above example.  Therefore, we
   can determine the value of the five components as

      scheme    = $2
      authority = $4
      path      = $5
      query     = $7
      fragment  = $9

Based on that:

URL_REGEX = %r!^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?!
'https://endpoint/v1.0/album/id/photo/id/'.match(URL_REGEX).captures
# => ["https:",
#     "https",
#     "//endpoint",
#     "endpoint",
#     "/v1.0/album/id/photo/id/",
#     nil,
#     nil,
#     nil,
#     nil]

'https://endpoint/v1.0/album/id/photo/id/'.match(URL_REGEX).captures[4]
# => "/v1.0/album/id/photo/id/"

Upvotes: 0

Matt
Matt

Reputation: 74620

The full regex solution is what the URI library does in the background. Doing it on your own is largely an exercise in futility...

In any case, a simple regex using named capture groups (?<name>) and the /x flag on the end to allow whitespace in the formatting.

url = 'https://endpoint/v1.0/album/id/photo/id/'

re = /
              ^                    # beginning of string
  (?<scheme>  https?             ) # http or s
              :\/\/                # seperator
  (?<domain>  [[a-zA-Z0-9]\.-]+? ) # many alnum, -'s or .'s
  (?<path>    \/.+               ) # forward slash on is the path
/x

res = url.match re
res[:path] if res

This pales in comparison to URI

Upvotes: 1

bluemill
bluemill

Reputation: 1

Here's a regex solution:

domain = 'endpoint'
link = "https://#{domain}/v1.0/album/id/photo/id/"
path = link.gsub("https://#{domain}", '')
# => "/v1.0/album/id/photo/id/"

You can adjust the domain name by changing the "domain" variable. I used the String.gsub function to replace first portion of your link with an empty string (the regular expression part done on line 3 is actually surprisingly simple! It's literally http:// endpoint), which means the path is the only part of the string that will remain.

Upvotes: 0

Arup Rakshit
Arup Rakshit

Reputation: 118261

here we go:

2.0.0-p451 :001 > require 'uri'
 => true
2.0.0-p451 :002 > URI('https://endpoint/v1.0/album/id/photo/id/').path
 => "/v1.0/album/id/photo/id/"
2.0.0-p451 :003 >

Read this Basic example.

Upvotes: 5

Related Questions