Mohamed AL ANI
Mohamed AL ANI

Reputation: 2062

regex : get part of text from url data

I have many of this type of url :

http://www.example.com/some-text-to-get/jkl/another-text-to-get

I want to be able to get this :

["some-text-to-get", "another-text-to-get"]

I tried this :

re.findall(".*([[a-z]*-[a-z]*]*).*", "http://www.example.com/some-text-to-get/jkl/another-text-to-get")

but it's not working. Any idea ?

Upvotes: 1

Views: 54

Answers (4)

The fourth bird
The fourth bird

Reputation: 163362

You could capture the 2 parts in a capturing group:

^https?://[^/]+/([^/]+).*/(.*)$

That would match:

  • ^ Match from the start of the string
  • https?:// Match http with an optional s followed by ://
  • [^/]+/ Match not a forward slash using a negated character class followed by a forward slash
  • ([^/]+) Capture in a group (group 1) not a forward slash
  • .* Match any character zero or more times
  • / Match literally (this is the last slash because the .* is greedy
  • (.*)$ Match in a capturing group (group 2) zero or more times any character and assert the end of the line $

Your matches are in the first and second capturing group.

Demo

Or you could parse the url, get the path, split by a / and get your parts by index:

from urlparse import urlparse

o = urlparse('http://www.example.com/some-text-to-get/jkl/another-text-to-get')
parts = filter(None, o.path.split('/'))
print(parts[0])
print(parts[2])

Or if you want to get the parts that contain a - you could use:

parts = filter(lambda x: '-' in x, o.path.split('/'))
print(parts)

Demo

Upvotes: 2

leamon
leamon

Reputation: 41

You could capture it using this regular expression:

((?:[a-z]+-)+[a-z]+)

  • [a-z]+ match one or more character

  • (?:[a-z]+-) don't capture in group

Upvotes: 0

dawg
dawg

Reputation: 103864

Given:

>>> s
"http://www.example.com/some-text-to-get/jkl/another-text-to-get"

You can use this regex:

>>> re.findall(r"/([a-z-]+)(?:/|$)", s)
['some-text-to-get', 'another-text-to-get']

Of course you can do this with Python string methods and a list comprehension:

>>> [e for e in s.split('/') if '-' in e]
['some-text-to-get', 'another-text-to-get']

Upvotes: 0

Ajax1234
Ajax1234

Reputation: 71451

You can use a lookbehind and lookahead:

import re
s = 'http://www.example.com/some-text-to-get/jkl/another-text-to-get'
final_result = re.findall('(?<=\.\w{3}/)[a-z\-]+|[a-z\-]+(?=$)', s)

Output:

['some-text-to-get', 'another-text-to-get']

Upvotes: 1

Related Questions