Reputation: 8338
I have this file:
a=1 b=2 1234j12342134h d="a v" id="y_123456" something else
a=1 b=2 1234j123421341 d="a" something else
a=1 b=2 1234j123421342 d="a D v id=" id="y_123458" something else
a=1 b=2 1234j123421344 d="a v" something else
a=1 b=2 1234j123421346 d="a.a." id="y_123410" something else
and I want to retrieve only the lines that contain 'id=', and only the value for id and the 3rd column. The final product should be
1234j12342134h id="y_123456"
1234j123421342 id="y_123458"
1234j123421346 id="y_123410"
or
1234j12342134h "y_123456"
1234j123421342 "y_123458"
1234j123421346 "y_123410"
or even
1234j12342134h y_123456
1234j123421342 y_123458
1234j123421346 y_123410
I tried a grep -o
for the begin and end of the expression, but that misses the first block of ids. I tried awk, but that fails for columns with spaces.
I got it working with Java, but it is slow as the log files get bigger.
How can I do it using bash utilities?
Upvotes: 2
Views: 51
Reputation: 1239
Using the Unix shell only, perhaps mistaking bash utilities for just builtins (on my part), its read
command can split every line into field variables of your choice, based on the input field separator IFS
(blank, by default). For example, processing only your first line in a test case,
$ echo a=1 b=2 1234j12342134h d="a v" id="y_123456" something else | \
if read ign1 ign2 f3 ign4 ign5 f6 rest
then echo $f3 $f6;
fi
1234j12342134h id=y_123456
$
You could go from here to cat
and a while
loop, read
ing all the lines, and handling each according to its structure. (Note that in the way above, you'll loose the quote characters, because they are interpreted by the shell.) Handling the pieces can become rather complex, requiring further commands and conditionals.
Therefore, better options would include using awk
or Perl, with the string processing logic adapted from your Java solution. In any solution, splitting input at certain places in each line seems a good first step, since a single, all-encompassing regular expression for grep
would seem rather tricky.
Upvotes: -1
Reputation: 204406
With GNU awk (for 3rd arg for match()):
$ gawk 'match($0,/id="[^" ]+"/,a){ print $3, a[0] }' file
1234j12342134h id="y_123456"
1234j123421342 id="y_123458"
1234j123421346 id="y_123410"
WIth other awks:
$ awk 'match($0,/id="[^" ]+"/){ print $3, substr($0,RSTART,RLENGTH) }' file
1234j12342134h id="y_123456"
1234j123421342 id="y_123458"
1234j123421346 id="y_123410"
or if you want to strip some of the leading/trailing chars a couple of ways would be:
$ gawk 'match($0,/id="([^" ]+)"/,a){ print $3, a[1] }' file
1234j12342134h y_123456
1234j123421342 y_123458
1234j123421346 y_123410
or:
$ awk 'match($0,/id="[^" ]+"/){ print $3, substr($0,RSTART+4,RLENGTH-5) }' file
1234j12342134h y_123456
1234j123421342 y_123458
1234j123421346 y_123410
Upvotes: 5