Reputation: 2907
I am trying to get a URL out of a long string and I am unsure how write the regex;
$ string = '192.00.00.00 - WWW.WEBSITE.COM GET /random/url/link'
I am trying to use the 're.search' function in order to pull out the WWW.WEBSITE.COM only without spaces. I would like it look like this;
$ get_site = re.search(regex).group()
$ print get_site
$ WWW.WEBSITE.COM
Upvotes: 1
Views: 811
Reputation: 174696
You could use this regex also.
>>> import re
>>> string = '192.00.00.00 - WWW.WEBSITE.COM GET /random/url/link'
>>> match = re.search(r'-\s+([^ ]+)\s+GET', string)
>>> match.group(1)
'WWW.WEBSITE.COM'
Breakdown of regex:
- # a literal -
\s+ # one or more spaces
([^ ]+) # Matches not of space character one or more times and () helps to store the captured characters into a group.
\s+ # one or more spaces
GET # All the above must followed the string GET
Upvotes: 0
Reputation: 1015
I wrote the following regex a while ago for a PHP project, its based on the dedicated RFC so it will cover any valid URL. I remember I tested it extensively too, so it should be reliable.
const re_host = '(([a-z0-9-]+\.)+[a-z]+|([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])){3})';
const re_port = '(:[0-9]+)?';
const re_path = '([a-z0-9-\._\~\(\)]|%[0-9a-f]{2})+';
const re_query = '(\?(([a-z0-9-\._\~!\$&\'\(\)\*\+,;=:@/\?]|%[0-9a-f]{2})*)?)?';
const re_frag = '(#(([a-z0-9-\._\~!\$&\'\(\)\*\+,;=:@/\?]|%[0-9a-f]{2})*)?)?';
const re_localpart = '[a-z0-9!#\$%&\'*\+-/=\?\^_`{|}\~\.]+';
const re_GraphicFileExts = '\.(png|gif|jpg|jpeg)';
$this->re_href = '~^'.'('.'https?://'.self::re_host.self::re_port.'|)'.'((/'.self::re_path.')*|/?)'.'/?'.self::re_query.self::re_frag.'$~i';
Upvotes: 0
Reputation:
BUT they will all be in between a (-) and the (GET)
That is all the information you need:
>>> import re
>>> string = '192.00.00.00 - WWW.WEBSITE.COM GET /random/url/link'
>>> re.search('-\s+(.+?)\s+GET', string).group(1)
'WWW.WEBSITE.COM'
>>>
Below is a breakdown of what the Regex pattern is matching:
- # -
\s+ # One or more spaces
(.+?) # A capture group for one or more characters
\s+ # One or more spaces
GET # GET
Note too that .group(1)
gets the text captured by (.+?)
. .group()
would return the entire match:
>>> re.search('-\s+(.+?)\s+GET', string).group()
'- WWW.WEBSITE.COM GET'
>>>
Upvotes: 7
Reputation: 720
WWW\.(.+)\.[A-Z]{2,3}
WWW #WWW
\. #dot
(.+) #one or more arbitrary characters
\. #dot, again
[A-Z]{2,3} #two or three alphabetic uppercase characters (as there are .eu domain, for example)
Upvotes: 0