Linguist
Linguist

Reputation: 123

Can grep or sed show only words that match multiple search patterns in a line?

I am wondering, if one can print the matched strings as it is in each line... using grep or sed?

TestCase1: File1 contains below text

The Sun
Thunder The Rain They say
They say The dance

If I use this command:

egrep -o 'The|They' File1

The output I get is:

The
The
They
They
The

But, my expected output should be as below:

The
The They
They The

I am aware that, In grep the option -o, --only-matching prints only the matched non-empty) parts of a matching line, with each such part on a separate output line.

Edit: Please also suggest, if one wants to have a filter with exact word match with multiple match strings

 i.e. <The> and <They> exact word match? Space separated words simply.

TestCase2: File2 contains below text

The Sun
Thunder The Rain They say
They say The dance
They're dancing with them in the dorm
The sun is shining the east and they scream.

Output is:

The
The They
They the
the
The the they

How to approach this?

Upvotes: 1

Views: 1008

Answers (5)

zdim
zdim

Reputation: 66964

Things are not fully specified so here are a couple of possibilities

  • To catch all words starting with The, and print them with a space in between

    perl -wnE'say join " ", /\bThe\w*/g' file
    

    where \b is a word-boundary, a zero-width anchor, and \w is a word character. Using \S (a non-space character) is yet more permissive.

  • For only The or They can instead use

    perl -wnE'say join " ", /\bThey?\b/g' file
    

    where y? makes y optional.

To allow the as well use [tT] instead of T in the pattern, or /i for either case for all chars.


It's been clarified in coments that punctuation after The|They isn't allowed, and that low case t is. Then we need to constrain the match by space, not word boundary, and use [tT] as mentioned

perl -wnE'say join " ", /\b([Tt]hey?)\s/g'  file

Now the capturing parenthesis () are needed since \s does consume, unlike \b before.

This prints the desired output with the provided input.

Upvotes: 2

cbmckay
cbmckay

Reputation: 496

Best do it with Perl:

~$ perl -nE 'say /They? /g' File1
The
The They
They The

EDIT : Add new conditions. The regex still matches all but the lowercase the. Adding the i flag makes the match case-insensitive and matches all your test strings.

$ perl -nE 'say /They? /ig' File1
The
The They
They The
the
The the they

There is a little bit of a trick here: the match also picks up the space after the ? and prints it in the output. E.g. the first line of output is realy: "The_\n" - where "_" = space character. This may or may not be acceptable. One way to remove the spaces and reassemble the string would be:

$ perl -nE 'say join " ", map {substr $_,0,-1} /They? /ig' File1

As to your question about matching full words <The> and <They>, as you put it, the ? in They? indicates that the 'y' is optional. I.e. matches 0 or 1 times. Therefore the pattern is considering 'The' and 'They' as full words, one or the other, followed by a space. You could rewrite the pattern as:

$ perl -nE 'say /(?:They|The) /ig' File1

And effect the same output.

Now that you are considering lowercase the you may run into more edge case "gotchas" like words that end in "the". "loathe" and "tythe" come to mind.

$ echo "I'm loathe to cringe and tythe socks" >> File1
$ perl -nE 'say /They? /ig' File1
The
The They
They The
the
The the they
the the  <--- not wanted!

You can then add the \b test in to match on word boundaries (as in zdim's answer):

$ perl -nE 'say /\bThey? /ig' File1
The
The They
They The
the
The the they
              <-- But you get this empty line where no match occurs

So to refine further, you could only print if the line matches. Like this:

$ perl -nE 'say /\bThey? /ig if /\bThey? /i' File1
The
The They
They The
the
The the they

Then, I'm sure, you can find more edge cases that will blow it all up and force further refinement.

Upvotes: 3

Ed Morton
Ed Morton

Reputation: 204426

With GNU awk for FPAT:

$ awk -v FPAT='\\<[Tt]hey?\\>' '{$1=$1}1' file
The
The They
They The
They the
The the they

Note that that can't NOT identify They when it appears in They're. If that's really an issue and you want to look for space-separated complete strings then this might be what you want:

$ awk '{c=0; for (i=1;i<=NF;i++) if ($i ~ /^[Tt]hey?$/) printf "%s%s", (c++?OFS:""), $i; print ""}' file
The
The They
They The
the
The the they

If not, let us know.

The above was run against this iteration of the OPs posted sample input:

$ cat file
The Sun
Thunder The Rain They say
They say The dance
They're dancing with them in the dorm
The sun is shining the east and they scream.

Upvotes: 4

RavinderSingh13
RavinderSingh13

Reputation: 133710

try one more awk:

awk '{while(match($0,/The|They/)){string=substr($0,RSTART,RLENGTH);VAL=VAL?VAL OFS string:string;$0=substr($0,RSTART+RLENGTH+1);};print VAL;VAL=""}'   Input_file

NON-ONE line form of solution as follows too.

awk '{
        while(match($0,/The|They/)){
                                        string=substr($0,RSTART,RLENGTH);
                                        VAL=VAL?VAL OFS string:string;
                                        $0=substr($0,RSTART+RLENGTH+1);
                                   };
        print VAL;
        VAL=""
     }
    '   Input_file

Will add the explanation shortly for same.

Upvotes: 1

karakfa
karakfa

Reputation: 67547

awk to the rescue!

$ awk -v p="They?" '$0~p{for(i=1;i<=NF;i++) if($i~p) printf "%s",$i OFS; print ""}' file

The
The They
They The

Upvotes: 1

Related Questions