Reputation: 11946

In GNU Grep or another standard bash command, is it possible to get a resultset from regex?

Consider the following:

var="text more text and yet more text"
echo $var | egrep "yet more (text)"

It should be possible to get the result of the regex as the string: text

However, I don't see any way to do this in bash with grep or its siblings at the moment.

In perl, php or similar regex engines:

$output = preg_match('/yet more (text)/', 'text more text yet more text');
$output[1] == "text";

Edit: To elaborate why I can't just multiple-regex, in the end I will have a regex with multiple of these (Pictured below) so I need to be able to get all of them. This also eliminates the option of using lookahead/lookbehind (As they are all variable length)

egrep -i "([0-9]+) +$USER +([0-9]+).+?(/tmp/Flash[0-9a-z]+) "

Example input as requested, straight from lsof (Replace $USER with "j" for this input data):

npviewer. 17875          j   11u      REG                8,8 59737848     524264 /tmp/FlashXXu8pvMg (deleted)
npviewer. 17875          j   17u      REG                8,8 16037387     524273 /tmp/FlashXXIBH29F (deleted)

The end goal is to cp /proc/$var1/fd/$var2 ~/$var3 for every line, which ends up "Downloading" flash files (Flash used to store in /tmp but they drm'd it up)

So far I've got:

#!/bin/bash
regex="([0-9]+) +j +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+)"

echo "npviewer. 17875          j   11u      REG                8,8 59737848     524264 /tmp/FlashXXYOvS8S (deleted)" |
sed -r -n -e " s%^.*?$regex.*?\$%\1 \2 \3%p " |
while read -a array
do
   echo /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done

It cuts off the first digits of the first value to return, and I'm not familiar enough with sed to see what's wrong.

End result for downloading flash 10.2+ videos (Including, perhaps, encrypted ones):

#!/bin/bash
lsof | grep "/tmp/Flash" | sed -r -n -e " s%^.+? ([0-9]+) +$USER +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+).*?\$%\1 \2 \3%p " |
while read -a array
do
   cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done

Upvotes: 1

Answers (5)

Paŭlo Ebermann

Reputation: 74810

Edit: look at my other answer for a simpler bash-only solution.

So, here the solution using sed to fetch the right groups and split them up. You later still have to use bash to read them. (And in this way it only works if the groups themselves do not contain any spaces - otherwise we had to use another divider character and patch read by setting $IFS to this value.)

#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "


sed -r -n -e " s%^.*$regex.*\$%\1 \2 \3%p " |
while read -a array
do
   cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done

Note that I had to adapt your last regex group to allow uppercase letters, and added a space at the beginning to be sure to capture the whole block of numbers. Alternatively here a \b (word limit) would have worked, too.

Ah, I forget mentioning that you should pipe the text to this script, like this:

 ./grep-result.sh  < grep-result-test.txt

(provided your files are named like this). Instead you can add a < grep-result-test after the sed call (before the |), or prepend the line with cat grep-result-test.txt |.

How does it work?

sed -r -n calls sed in extended-regexp-mode, and without printing anything automatically.
-e " s%^.*$regex.*\$%\1 \2 \3%p " gives the sed program, which consists of a single s command.
- I'm using % instead of the normal / as parameter separator, since / appears inside the regex and I don't want to escape it.
- The regex to search is prefixed by ^.* and suffixed by .*$ to grab the whole line (and avoid printing parts of the rest of the line).
  
  Note that this .* grabs greedy, so we have to insert a space into our regexp to avoid it grabbing the start of the first digit group too.
- The replacement text contains of the three parenthesed groups, separated by spaces.
- the p flag at the end of the command says to print out the pattern space after replacement. Since we grabbed the whole line, the pattern space consists of only the replacement text.
So, the output of sed for your example input is this:
```
5 11 /tmp/FlashXXu8pvMg
5 17 /tmp/FlashXXIBH29F
```
This is much more friendly for reuse, obviously.
Now we pipe this output as input to the while loop.
- read -a array reads a line from standard input (which is the output from sed, due to our pipe), splits it into words (at spaces, tabs and newlines), and puts the words into an array variable.
  
  We could also have written read var1 var2 var3 instead (preferably using better variable names), then the first two words would be put to $var1 and $var2, with $var3 getting the rest.
- If read succeeded reading a line (i.e. not end-of-file), the body of the loop is executed:
  - ${array[0]} is expanded to the first element of the array and similarly.
- When the input ends, the loop ends, too.

Upvotes: 4

Paŭlo Ebermann

Reputation: 74810

After creating my sed-solution, I also wanted to try the pure-bash approach suggested by Mark. It works quite fine, for me.

#!/bin/bash

USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "

while read 
do
    if [[ $REPLY =~ $regex ]]
    then
        echo cp /proc/${BASH_REMATCH[1]}/fd/${BASH_REMATCH[2]} ~/${BASH_REMATCH[3]}
    fi
done

(If you upvote this, you should think about also upvoting Marks answer, since it is essentially his idea.)

The same as before: pipe the text to be filtered to this script.

How does it work?

As said by Mark, the [[ ... ]] special conditional construct supports the binary operator =~, which interprets his right operand (after parameter expansion) as a extended regular expression (just as we want), and matches the left operand against this. (We have again added a space at front to avoid matching only the last digit.)
When the regex matches, the [[ ... ]] returns 0 (= true), and also puts the parts matched by the individual groups (and the whole expression) into the array variable BASH_REMATCH.
Thus, when the regex matches, we enter the then block, and execute the commands there.
Here again ${BASH_REMATCH[1]} is an array-access to an element of the array, which corresponds to the first matched group. ([0] would be the whole string.)

Another note: Both my scripts accept multi-line input and work on every line which matches. Non-matching lines are simply ignored. If you are inputting only one line, you don't need the loop, a simple if read ; then ... or even read && [[ $REPLY =~ $regex ]] && ... would be enough.

Upvotes: 1

Mark

Reputation: 461

This isn't possible using grep or another tool called from a shell prompt/script because a child process can't modify the environment of its parent process. If you're using bash 3.0 or better, then you can use in-process regular expressions. The syntax is perl-ish (=~) and the match groups are available via $BASH_REMATCH[x], where x is the match group.

Upvotes: 4

nmichaels

Reputation: 51009

Well, for your simple example, you can do this:

var="text more text and yet more text"
echo $var | grep -e "yet more text" | grep -o "text"

Upvotes: 0

user611775

Reputation: 1353

echo "$var" | pcregrep -o "(?<=yet more )text"

Upvotes: 0

In GNU Grep or another standard bash command, is it possible to get a resultset from regex?

Answers (5)

Related Questions