Reputation: 123
I am wondering, if one can print the matched strings as it is in each line... using grep or sed?
The Sun
Thunder The Rain They say
They say The dance
If I use this command:
egrep -o 'The|They' File1
The output I get is:
The
The
They
They
The
But, my expected output should be as below:
The
The They
They The
I am aware that, In grep the option -o, --only-matching prints only the matched non-empty) parts of a matching line, with each such part on a separate output line.
Edit: Please also suggest, if one wants to have a filter with exact word match with multiple match strings
i.e. <The> and <They> exact word match? Space separated words simply.
The Sun
Thunder The Rain They say
They say The dance
They're dancing with them in the dorm
The sun is shining the east and they scream.
Output is:
The
The They
They the
the
The the they
How to approach this?
Upvotes: 1
Views: 1008
Reputation: 66964
Things are not fully specified so here are a couple of possibilities
To catch all words starting with The
, and print them with a space in between
perl -wnE'say join " ", /\bThe\w*/g' file
where \b
is a word-boundary, a zero-width anchor, and \w
is a word character. Using \S
(a non-space character) is yet more permissive.
For only The
or They
can instead use
perl -wnE'say join " ", /\bThey?\b/g' file
where y?
makes y
optional.
To allow the
as well use [tT]
instead of T
in the pattern, or /i
for either case for all chars.
It's been clarified in coments that punctuation after The|They
isn't allowed, and that low case t
is. Then we need to constrain the match by space, not word boundary, and use [tT]
as mentioned
perl -wnE'say join " ", /\b([Tt]hey?)\s/g' file
Now the capturing parenthesis ()
are needed since \s
does consume, unlike \b
before.
This prints the desired output with the provided input.
Upvotes: 2
Reputation: 496
Best do it with Perl:
~$ perl -nE 'say /They? /g' File1
The
The They
They The
EDIT : Add new conditions. The regex still matches all but the lowercase the
. Adding the i
flag makes the match case-insensitive and matches all your test strings.
$ perl -nE 'say /They? /ig' File1
The
The They
They The
the
The the they
There is a little bit of a trick here: the match also picks up the space after the ?
and prints it in the output. E.g. the first line of output is realy: "The_\n" - where "_" = space character. This may or may not be acceptable. One way to remove the spaces and reassemble the string would be:
$ perl -nE 'say join " ", map {substr $_,0,-1} /They? /ig' File1
As to your question about matching full words <The> and <They>, as you put it, the ?
in They?
indicates that the 'y' is optional. I.e. matches 0 or 1 times. Therefore the pattern is considering 'The' and 'They' as full words, one or the other, followed by a space. You could rewrite the pattern as:
$ perl -nE 'say /(?:They|The) /ig' File1
And effect the same output.
Now that you are considering lowercase the
you may run into more edge case "gotchas" like words that end in "the". "loathe" and "tythe" come to mind.
$ echo "I'm loathe to cringe and tythe socks" >> File1
$ perl -nE 'say /They? /ig' File1
The
The They
They The
the
The the they
the the <--- not wanted!
You can then add the \b
test in to match on word boundaries (as in zdim's answer):
$ perl -nE 'say /\bThey? /ig' File1
The
The They
They The
the
The the they
<-- But you get this empty line where no match occurs
So to refine further, you could only print if the line matches. Like this:
$ perl -nE 'say /\bThey? /ig if /\bThey? /i' File1
The
The They
They The
the
The the they
Then, I'm sure, you can find more edge cases that will blow it all up and force further refinement.
Upvotes: 3
Reputation: 204426
With GNU awk for FPAT
:
$ awk -v FPAT='\\<[Tt]hey?\\>' '{$1=$1}1' file
The
The They
They The
They the
The the they
Note that that can't NOT identify They
when it appears in They're
. If that's really an issue and you want to look for space-separated complete strings then this might be what you want:
$ awk '{c=0; for (i=1;i<=NF;i++) if ($i ~ /^[Tt]hey?$/) printf "%s%s", (c++?OFS:""), $i; print ""}' file
The
The They
They The
the
The the they
If not, let us know.
The above was run against this iteration of the OPs posted sample input:
$ cat file
The Sun
Thunder The Rain They say
They say The dance
They're dancing with them in the dorm
The sun is shining the east and they scream.
Upvotes: 4
Reputation: 133710
try one more awk:
awk '{while(match($0,/The|They/)){string=substr($0,RSTART,RLENGTH);VAL=VAL?VAL OFS string:string;$0=substr($0,RSTART+RLENGTH+1);};print VAL;VAL=""}' Input_file
NON-ONE line form of solution as follows too.
awk '{
while(match($0,/The|They/)){
string=substr($0,RSTART,RLENGTH);
VAL=VAL?VAL OFS string:string;
$0=substr($0,RSTART+RLENGTH+1);
};
print VAL;
VAL=""
}
' Input_file
Will add the explanation shortly for same.
Upvotes: 1
Reputation: 67547
awk
to the rescue!
$ awk -v p="They?" '$0~p{for(i=1;i<=NF;i++) if($i~p) printf "%s",$i OFS; print ""}' file
The
The They
They The
Upvotes: 1