Reputation: 17930
I am basically grepping with a regular expression on. In the output, I would like to see only the strings that match my reg exp.
In a bunch of XML files (mostly they are single-line files with huge amounts of data in a line), I would like to get all the words that start with MAIL_.
Also, I would like the grep command on the shell to give only the words that matched and not the entire line (which is the entire file in this case).
How do I do this?
I have tried
grep -Gril MAIL_* .
grep -Grio MAIL_* .
grep -Gro MAIL_* .
Upvotes: 14
Views: 33962
Reputation: 644
From your comment to Thor's answer it seems you also want to distinguish if the MAIL_.*
text is a text node or an attribute, not just to isolate it whenever it appears in the XML document. Grep cannot parse XML, you need a proper XML parser for that.
A command line xml parser is xmlstarlet. It is packaged in Ubuntu.
Using it on this example file example file:
$ cat test.xml
<some_root>
<test a="MAIL_as_attribute">will be printed if you want matching attributes</test>
<bar>MAIL_as_text will be printed if you want matching text nodes</bar>
<MAIL_will_not_be_printed>abc</MAIL_will_not_be_printed>
</some_root>
For selecting text nodes you can use:
$ xmlstarlet sel -t -m '//*' -v 'text()' -n test.xml | grep -Eo 'MAIL_[^[:space:]]*'
MAIL_as_text
And for selecting attributes:
$ xmlstarlet sel -t -m '//*[@*]' -v '@*' -n test.xml | grep -Eo 'MAIL_[^[:space:]]*'
MAIL_as_attribute
Brief explanations:
//*
is an XPath expression that selects all elements in the document and text()
outputs the value of their children text nodes, therefore everything except text nodes gets filtered out//*[@*]
is an XPath expression that selects all attributes in the document and then @*
outputs their valueUpvotes: 0
Reputation: 2284
First of all, with GNU grep that is installed with Ubuntu, -G flag (use basic regexp) is the default, so you can omit it, but, even better, use extended regexp with -E.
-r flag means recursive search within files of a directory, this is what you need.
And, you are right to use -o flag to print matching part of a line. Also, to omit file names you will need a -h flag.
The only mistake you made is the regular expression itself. You missed character specification before *. Your command should look like this:
grep -Ehro 'MAIL_[^[:space:]]*' .
Sample output (not recursive):
$ echo "Some garbage MAIL_OPTION comes MAIL_VALUE here" | grep -Eho 'MAIL_[^[:space:]]*'
MAIL_OPTION
MAIL_VALUE
Upvotes: 18
Reputation: 101
grep -o or --only-matching
outputs only the matching text instead of complete lines but the problem could be your regex that's not restrictive or greedy enough and actually matches the whole file.
Upvotes: 2