Beans On Toast
Beans On Toast

Reputation: 1081

Capture everything before second slash - regex

I have the following string:

/youtube.com/videos/cats
/google.com/images/dogs

I'm trying to find a regex formula that will allow me to capture the text up to the second slash (ignoring the rest of the string)

SO it would look like this

/youtube.com/
/google.com/

For reference I am using Python 3.7

I have tried positive lookbehinds and the closest I got was this: [^/]/

Any help appreciated

Upvotes: 0

Views: 2505

Answers (3)

Steven
Steven

Reputation: 6148

Input

/youtube.com/videos/cats
/google.com/images/dogs

RegEx

/.*?/    : [1]
/        : First slash
 .*?     : Non-greedy match anything
    /    : Second slash

    /* Output:
          FULL MATCH
      1>  /youtube.com/
      2>  /google.com/
    */



/(.*?)/  : [2] : Same as [1] BUT captures the text between the slashes as a group {symbol: (...)}
^/(.*?)/ : [3] : Same as [2] BUT specifies match must be at start of string {symbol: ^}

    /* Output:
          FULL MATCH           CAPTURE GROUP
      1>  /youtube.com/        youtube.com
      2>  /google.com/         google.com
    */


(/(.*?)(?=$|/))  : [4] : Captures text between all slashes
** FLAGS: MULTILINE
    ** Unless passed in individually (i.e. one expression per URL)

    /* Output:
          Full MATCH          CAPTURE GROUP
      1>  /youtube.com        youtube.com
      2>  /videos             videos
      3>  /cats               cats
      4>  /google.com/        google.com
      5>  /images             images
      6>  
    */



/(.*?)(?=$|/)    : [5] : Same as [4] BUT doesn't capture leading slashes
** FLAGS: MULTILINE
    ** Unless passed in individually (i.e. one expression per URL)

    /* Output:
          FULL MATCH
      1>  youtube.com
      2>  videos
      3>  cats
      4>  google.com
      5>  images
      6>  dogs
    */

Example 1

Match text between first two slashes. Singular input.

import re
regex    = r"/(.*?)/"
test_str = "/youtube.com/videos/cats"
matches  = re.findall(regex, test_str)

// RESULT: matches == ['youtube.com']

Example 2

Match text between first two slashes. Multiline input.

import re
regex    = r"/(.*?)/"
test_str = """
/youtube.com/videos/cats
/google.com/images/dogs
"""
matches  = re.findall(regex, test_str)

//RESULT : matches == ['youtube.com', 'google.com']

Example 3

Match text between all slashes. Singular input.

import re
regex    = r"/(.*?)(?=$|/)"
test_str = "/youtube.com/videos/cats"
matches  = re.findall(regex, test_str)

// RESULT: matches == ['youtube.com', 'videos', 'cats']

Example 4

Match text between all slashes. Multiline input.

import re
regex = r"/(.*?)(?=$|/)"
test_str = """
/youtube.com/videos/cats
/google.com/videos/cats
"""
matches = re.findall(regex, test_str, re.MULTILINE)

//RESULT: matches == ['youtube.com', 'videos', 'cats', 'google.com', 'videos', 'cats']

Upvotes: 0

user5386938
user5386938

Reputation:

If you want to use re.sub to remove the text after the 2nd slash then perhaps the following will help.

import re

data = '''\
/youtube.com/videos/cats
/google.com/images/dogs
'''
pattern = re.compile(r'^(/[^/]+/).+?$', re.MULTILINE)

print(pattern.sub(r'\1', data))

Upvotes: 0

Green Cloak Guy
Green Cloak Guy

Reputation: 24691

The regex I provided in a comment will work. By matching the start of the string with re.match(), you can extract the area that was matched as a group.

>>> your_string = '/google.com/images/dogs'
>>> import re
>>> re.match(r'^/[^/]*/', your_string).group(0)
'/google.com/'

Here's how the regex is laid out:

  • ^ start of string
  • / a slash character
  • [^/]* any number of characters that are not slashes
  • / another slash character

So this regex will capture the first slash, the second slash, and the text in between them, as long as they come at the beginning of the string.


If you were to want the rest of the string, ignoring this first part, you could just add a capture group afterwards and pull group 1 (the first captured group) instead of 0 (the entire match):

>>> re.match(r'^/[^/]*/(.*)$', your_string).group(1)
'images/dogs'

Upvotes: 2

Related Questions