Veerendra Mattaparthi
Veerendra Mattaparthi

Reputation: 27

Grep for a pattern

I have an HTML file with the following code

<html>
  <body>
    Test #1 '<%aaa(x,y)%>'
    Test #2 '<%bbb(p)%>'
    Test #3 '<%pqr(z)%>'
  </body>
</html>

Please help me with the regex for a command (grep or awk) which displays the output as follows:

'<%aaa(x,y)%>'
'<%bbb(p)%>'
'<%pqr(z)%>'

Upvotes: 0

Views: 390

Answers (3)

小武哥
小武哥

Reputation: 446

grep -P "^Test" 1.htm |awk '{print $3}'

Upvotes: 0

Jonathan Leffler
Jonathan Leffler

Reputation: 753525

I think that sed is a better choice than awk, but it is not completely clear cut.

sed -n '/ *Test #[0-9]* */s///p' <<!
<html>
  <body>
    Test #1 '<%aaa(x,y)%>'
    Test #2 '<%bbb(p)%>'
    Test #3 '<%pqr(z)%>'
  </body>
</html>
!

You can't use grep; it returns lines that match a pattern, but doesn't normally edit those lines.

You could use awk:

awk '/Test #[0-9]+/ { print $3 }'

The pattern matches the test lines and prints the third field. It works because there are no spaces after the test number third field. If there could be spaces there, then the sed script is easier; it already handles them, whereas the awk script would have to be modified to handle them properly.


Judging from the comments, the desired output is the material between '<%' and '%>'. So, we use sed, as before:

sed -n '/.*\(<%.*%>\).*/s//\1/p'

On lines which match 'anything-<%-anything-%>-anything', replace the whole line with the part between '<%' and '%>' (including the markers) and print the result. Note that if there are multiple patterns on the line which match, only the last will be printed. (The question and comments do not cover what to do in that case, so this is acceptable. The alternatives are tough and best handled in Perl or perhaps Python.)

If the single quotes on the lines must be preserved, then you can use either of these - I'd use the first with the double quotes surrounding the regex, but they both work and are equivalent. OTOH, if there were expressions involving $ signs or back-ticks in the regex, the single-quotes are better; there are no metacharacters within a single-quoted string at the shell level.

sed -n "/.*\('<%.*%>'\).*/s//\1/p"
sed -n '/.*\('\''<%.*%>'\''\).*/s//\1/p'

The sequence '\'' is how you embed a single quote into a single-quoted string in a shell script. The first quote terminates the current string; the backslash-quote generates a single quote, and the last quote starts a new single-quoted string.

Upvotes: 1

glenn jackman
glenn jackman

Reputation: 246754

the -o option for grep is what you want:

grep -o "'.*'" filename

Upvotes: 0

Related Questions