Reputation: 373
I am getting the following error when I am trying to parse "bloomberg" out of the self.web_url. type of self.web_url is unicode, so I am assuming that might be the reason why. However, I do not know how to implement type conversions if necessary or what to do
self.web_url = "http://www.bloomberg.com"
start = "http:/www."
end = ".com")
print type(self.web_url)
web_name = re.search('%s(.*)%s' % (start, end), self.web_url).group(1)
Upvotes: 1
Views: 1244
Reputation: 626758
You get the error because there is no match. Your pattern is incorrect since it matches a single /
, while there are 2 /
s after http:
. You need to fix the pattern as heemayl suggests or use an alternative urlparse
based solution to get the netloc
part, and get the part in between the first and last dots (either with find
and rfind
, or regex):
import urlparse, re
path = urlparse.urlparse("http://www.bloomberg.com")
print(path.netloc[path.netloc.find(".")+1:path.netloc.rfind(".")]) # => bloomberg
# or a regex:
print(re.sub(r"\A[^.]*\.(.*)\.[^.]*\Z", r"\1", path.netloc)) # => bloomberg
# or Regex 2:
mObj = re.search(r"\.(.*)\.", path.netloc);
if mObj:
print(mObj.group(1)) # => bloomberg
See Python demo
Regex 1 - \A[^.]*\.(.*)\.[^.]*\Z
- will will match the start of string (\A
), then 0+ non-.
s ([^.]*
), then a dot (\.
), then will capture any 0+ chars other than a newline into Group 1, then will match .
and 0+ non-.
s up to the very end of the string (\Z
).
Regex 2 will just match the first .
followed with any 0+ chars up to the last .
capturing what is in between .
s into Group 1.
Upvotes: 1
Reputation: 42007
You are missing a /
in start
:
start = 'http://www.'
Also note that, the .
has a special meaning in Regex, its a Regex token that will match any single character, not literal .
. You need to escape it to make it literal i.e. \.
.
So you better do:
start = "http://www\."
end = "\.com"
Upvotes: 1