Magnus
Magnus

Reputation: 3722

Regex: how to match up to a character or the end of a line?

I am trying to separate out parts of a path as follows. My input path takes the following possible forms:

bucket
bucket/dir1
bucket/dir1/dir2
bucket/dir1/dir2/dir3
...

I want to separate the first part of the path (bucket) from the rest of the string if present (dir1/dir2/dir3/...), and store both in separate variables.

The following gives me something close to what I want:

❯ BUCKET=$(echo "bucket/dir1/dir2" | sed 's@\(^[^\/]*\)[\/]\(.*\)@\1@')
❯ EXTENS=$(echo "bucket/dir1/dir2" | sed 's@\(^[^\/]*\)[\/]\(.*\)@\2@')
echo $BUCKET $EXTENS
❯ bucket dir1/dir2

HOWEVER, it fails if I only have bucket as input (without a slash):

❯ BUCKET=$(echo "bucket" | sed 's@\(^[^\/]*\)[\/]\(.*\)@\1@')
❯ EXTENS=$(echo "bucket" | sed 's@\(^[^\/]*\)[\/]\(.*\)@\2@')
echo $BUCKET $EXTENS
❯ bucket bucket

... because, in the absence of the first '/', no capture happens, so no substitution takes place. When the input is just 'bucket' I would like $EXTENS to be set to the empty string "".

Thanks!

Upvotes: 0

Views: 78

Answers (3)

Renaud Pacalet
Renaud Pacalet

Reputation: 29212

For something so simple you could use bash built-in instead of launching sed:

$ path="bucket/dir1/dir2"
$ bucket="${path%%/*}"
$ extens="${path#$bucket}"
$ printf '|%s|%s|\n' "$bucket" "$extens"
|bucket|/dir1/dir2|
$ path="bucket"
$ bucket="${path%%/*}"
$ extens="${path#$bucket}"
$ printf '|%s|%s|\n' "$bucket" "$extens"
|bucket||

But if you really want to use sed and capture groups:

$ declare -a bucket_extens
$ mapfile -td '' bucket_extens < <(printf '%s' "bucket/dir1/dir2" | sed -E 's!([^/]*)(.*)!\1\x00\2!')
$ printf '|%s|%s|\n' "${bucket_extens[@]}"
|bucket|/dir1/dir2|
$ mapfile -td '' bucket_extens < <(printf '%s' "bucket" | sed -E 's!([^/]*)(.*)!\1\x00\2!')
$ printf '|%s|%s|\n' "${bucket_extens[@]}"
|bucket||

We use the extended regex (-E) to simplify a bit, and ! as separator of the substitute command. The first capture group is simply anything not containing a slash and the second is everything else, including nothing if there's nothing else.

In the replacement string we separate the two capture groups with a NUL character (\x00). We then use mapfile to assign the result to bash array bucket_extens.

The NUL trick is a way to deal with file names containing spaces, newlines... NUL is the only character that cannot be part of a file name. The -d '' option of mapfile indicates that the lines to map are separated by NUL instead of the default newline.

Upvotes: 2

Bohemian
Bohemian

Reputation: 425063

Don't capture anything. Instead, just match what you don't want and replace it with nothing:

BUCKET=$(echo "bucket" | sed 's@/.*@@').          # bucket
BUCKET=$(echo "bucket/dir1/dir2" | sed 's@/.*@@') # bucket

EXTENS=$(echo "bucket" | sed 's@[^/]*@@')           # blank
EXTENS=$(echo "bucket/dir1/dir2" | sed 's@[^/]*@@') # /dir1/dir2

Upvotes: 1

tshiono
tshiono

Reputation: 22022

As you are putting a slash in the regex. the string with no slashes will not match. Let's make the slash optional as /\?. (A backslash before ? is requires due to the sed BRE.) Then would you please try:

#!/bin/bash

#path="bucket/dir1/dir2"
path="bucket"
bucket=$(echo "$path" | sed 's@\(^[^/]*\)/\?\(.*\)@\1@')
extens=$(echo "$path" | sed 's@\(^[^/]*\)/\?\(.*\)@\2@')
echo "$bucket" "$extens"
  • You don't need to prepend a backslash to a slash.
  • By convention, it is recommended to use lower cases for user variables.

Upvotes: 0

Related Questions