mhaken
mhaken

Reputation: 1125

Bash Regex Capture Groups

I have a single string that is this kind of format:

"Mike H<[email protected]>" [email protected] "Mike H<[email protected]>"

If I was writing a normal regex in JS, C#, etc, I'd do this

(?:"(.+?)"|'(.+?)'|(\S+))

And iterate the match groups to grab each string, ideally without the quotes. I ultimately want to add each value to an array, so in the example, I'd end up with 3 items in an array as follows:

Mike H<[email protected]>
[email protected] 
Mike H<[email protected]>

I can't figure out how to replicate this functionality with grep or sed or bash regex's. I've tried some things like

echo "$email" | grep -oP "\"\K(.+?)(?=\")|'\K(.+?)(?=')|(\S+)"

The problem with this is that while it kind of mimics the functionality of capture groups, it doesn't really work with multiples, so I get captures like

"Mike
H<[email protected]>"
 [email protected] 

If I remove the look ahead/behind logic, I at least get the 3 strings, but the first and last are still wrapped in quotes. In that approach, I pipe the output to read so I can individually add each string to the array, but I'm open to other options.

EDIT:

I think my input example may have been confusing, it's just a possible input. The real input could be double quoted, single quoted, or non-quoted (without spaces) strings in any order with any quantity. The Javascript/C# regex I provided is the real behavior I'm trying to achieve.

Upvotes: 6

Views: 18006

Answers (8)

JJoao
JJoao

Reputation: 5347

Your first expression is fine; just be careful with the quotes (use single quotes when \ are present). In the end trim the " with sed.

$ echo $mail | grep -Po '".*?"|\S+' | sed -r 's/"$|^"//g'
Mike H<[email protected]>
[email protected]
Mike H<[email protected]>

Upvotes: 1

mhaken
mhaken

Reputation: 1125

What I was able to do that worked, but wasn't as concise as I wanted the code to be:

arr=()
while read line; do
  line="${line//\"/}"
  arr+=("${line//\'/}")
done < <(echo $email | grep -oP "\"(.+?)\"|'(.+?)'|(\S+)")

This gave me an array of the capturing group and handled the input in any order, wrapped in double or single quotes or none at all if it didn't have a space. It also provided the elements in the array without the wrapping quotes. Appreciate all of the suggestions.

Upvotes: 0

James Brown
James Brown

Reputation: 37404

Using GNU awk and FPAT to define fields by content:

$ awk '
BEGIN { FPAT="([^ ]*)|(\"[^\"]*\")" }  # define a field to be space-separated or in quotes
{
    for(i=1;i<=NF;i++) {               # iterate every field
        gsub(/^\"|\"$/,"",$i)          # remove leading and trailing quotes
        print $i                       # output
    }
}' file
Mike H<[email protected]>
[email protected]
Mike H<[email protected]>

Upvotes: 0

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

gawk + bash solution (adding each item to array):

email_str='"Mike H<[email protected]>" [email protected] "Mike H<[email protected]>"'

readarray -t email_arr < <(awk -v FPAT="[^\"'[:space:]]+[^\"']+[^\"'[:space:]]+" \
                         '{ for(i=1;i<=NF;i++) print $i }' <<<$email_str)

Now, all items are in email_arr

Accessing the 2nd item:

echo "${email_arr[1]}"
[email protected]

Accessing the 3rd item:

echo "${email_arr[3]}"
Mike H<[email protected]>

Upvotes: 1

Rahul Verma
Rahul Verma

Reputation: 3089

Modify your regex like this :

grep -oP '("?\s*)\K.*?(?=")' file

Output:

Mike H<[email protected]>
[email protected]
Mike H<[email protected]>

Upvotes: 0

P....
P....

Reputation: 18371

Using gawk where you can set multi-line RS.

awk -v RS='"|" ' 'NF' inputfile
Mike H<[email protected]>
[email protected]
Mike H<[email protected]>

Upvotes: 0

dawg
dawg

Reputation: 103844

You can use Perl:

$ email='"Mike H<[email protected]>" [email protected] "Mike H<[email protected]>"'
$ echo "$email" | perl -lane 'while (/"([^"]+)"|(\S+)/g) {print $1 ? $1 : $2}' 
Mike H<[email protected]>
[email protected]
Mike H<[email protected]>

Or in pure Bash, it gets kinda wordy:

re='\"([^\"]+)\"[[:space:]]*|([^[:space:]]+)[[:space:]]*'
while [[ $email =~ $re ]]; do
    echo ${BASH_REMATCH[1]}${BASH_REMATCH[2]}
    i=${#BASH_REMATCH}
    email=${email:i}
done 
# same output

Upvotes: 6

CWLiu
CWLiu

Reputation: 4043

You may use sed to achieve that,

$ sed -r 's/"(.*)" (.*)"(.*)"/\1\n\2\n\3/g' <<< "$EMAIL"
Mike H<[email protected]>
[email protected] 
Mike H<[email protected]>

Upvotes: 1

Related Questions