Reputation: 2062
I have many of this type of url :
http://www.example.com/some-text-to-get/jkl/another-text-to-get
I want to be able to get this :
["some-text-to-get", "another-text-to-get"]
I tried this :
re.findall(".*([[a-z]*-[a-z]*]*).*", "http://www.example.com/some-text-to-get/jkl/another-text-to-get")
but it's not working. Any idea ?
Upvotes: 1
Views: 54
Reputation: 163362
You could capture the 2 parts in a capturing group:
^https?://[^/]+/([^/]+).*/(.*)$
That would match:
^
Match from the start of the stringhttps?://
Match http with an optional s followed by ://
[^/]+/
Match not a forward slash using a negated character class followed by a forward slash([^/]+)
Capture in a group (group 1) not a forward slash.*
Match any character zero or more times/
Match literally (this is the last slash because the .*
is greedy(.*)$
Match in a capturing group (group 2) zero or more times any character and assert the end of the line $
Your matches are in the first and second capturing group.
Or you could parse the url, get the path, split by a /
and get your parts by index:
from urlparse import urlparse
o = urlparse('http://www.example.com/some-text-to-get/jkl/another-text-to-get')
parts = filter(None, o.path.split('/'))
print(parts[0])
print(parts[2])
Or if you want to get the parts that contain a -
you could use:
parts = filter(lambda x: '-' in x, o.path.split('/'))
print(parts)
Upvotes: 2
Reputation: 41
You could capture it using this regular expression:
((?:[a-z]+-)+[a-z]+)
[a-z]+
match one or more character
(?:[a-z]+-)
don't capture in group
Upvotes: 0
Reputation: 103864
Given:
>>> s
"http://www.example.com/some-text-to-get/jkl/another-text-to-get"
You can use this regex:
>>> re.findall(r"/([a-z-]+)(?:/|$)", s)
['some-text-to-get', 'another-text-to-get']
Of course you can do this with Python string methods and a list comprehension:
>>> [e for e in s.split('/') if '-' in e]
['some-text-to-get', 'another-text-to-get']
Upvotes: 0
Reputation: 71451
You can use a lookbehind and lookahead:
import re
s = 'http://www.example.com/some-text-to-get/jkl/another-text-to-get'
final_result = re.findall('(?<=\.\w{3}/)[a-z\-]+|[a-z\-]+(?=$)', s)
Output:
['some-text-to-get', 'another-text-to-get']
Upvotes: 1