David542
David542

Reputation: 110267

How to regex split, but keep the split string?

I have the following URL pattern:

http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en

I would like to get everything up until and inclusive of /watch/\d+/.

So far I have:

>>> re.split(r'watch/\d+/', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'supernatural-dub-hollywood-babylon/en']

But this does not include the split string (the string which appears between the domain and the path). The end answer I want to achieve is:

http://www.hulu.jp/watch/589851

Upvotes: 4

Views: 141

Answers (4)

Veedrac
Veedrac

Reputation: 60147

You've surely seen the Stack Overflow don't-parse-HTML-with-regex post, yes?

You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.

Well, regex can parse URLs, but trying to do so when there's a plethora of better tools is foolish.

This is what a regex for URLs looks like:

^(?:(?:https?|ftp):\/\/)(?:\S+(?::\S*)?@)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?$ (+ caseless flag)

It's just a mess of characters, right? Exactly!

Don't parse URLs with regex... almost.

There is one simple thing:

A path-relative URL must be zero or more path segments separated from each other by a "/".

Splitting the URL should be as simple as url.split("/").

from urllib.parse import urlparse, urlunparse

myurl = "http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en"

# Run a parser over it
parts = urlparse(myurl)

# Crop the path to UP TO length 2
new_path = str("/".join(parts.path.split("/")[:3]))

# Unparse
urlunparse(parts._replace(path=new_path))
#>>> 'http://www.hulu.jp/watch/589851'

Upvotes: 0

Adam Parkin
Adam Parkin

Reputation: 18680

As mentioned in the other answer, you need to use groups to capture the "glue" between the split strings.

I wonder though, is what you want here a split() or a search()? It looks (from the sample) that you're trying to extract from a URL everything from the first occurrence of /watch/XXX/ where XXX is 1 or more digits, to the end of the string. If that's the case, then a match/search might be more suitable, as with a split if the search regex can match multiple times you'll split into multiple groups. Ex:

re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/', 'watch/2342/', 'fdsaafsdf']

Which doesn't look like what you want. Instead perhaps:

result = re.search(r'(watch/\d+/)(.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groups() if result else []

which gives:

('watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')

You could also use this approach combined with named groups to get extra fancy:

result = re.search(r'(?P<watchId>watch/\d+/)(?P<path>.*)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf')
result.groupdict() if result else {}

giving:

{'path': 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', 'watchId': 'watch/589851/'}

If you're set on the split() approach, you can also set the maxsplit parameter to ensure it's only split once:

re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf', maxsplit=1)

giving:

['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/watch/2342/fdsaafsdf']

Personally though, I find that when parsing URL's into constituent parts the search() with named groups approach works extremely well as it allows you to name the various parts in the regex itself, and via groupdict() get a nice dictionary you can use for working with those parts.

Upvotes: 4

Kasravnd
Kasravnd

Reputation: 107297

You need to use capture group :

>>> re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/en']

Upvotes: 6

apgp88
apgp88

Reputation: 985

You can try following regex

.*\/watch\/\d+

Working Demo

Upvotes: -1

Related Questions