Reputation: 427
I am trying to find this pattern match. I want to match and display only the first of the matches in the same line. And one of the matches, the fourth field can be match either of the two patterns i.e; A,BCD.EF or AB.CD . An example would be
Example 1:
12:23 23:23 ASDFGH 1,232.00 22.00
21:22 12:12 ASDSDS 22.00 21.00
The expected output would be
Expected Result 1:
12:23 ASDFGH 1,232.00
21:22 ASDSDS 22.00
I have got this far using my little knowledge of grep and stackoverflow.
< test_data.txt grep -one "[0-9]/[0-9][0-9]\|[0-9]*,[0-9]*.[0-9][0-9]\|[0-9]*.[0-9][0-9]" | awk -F ":" '$1 == y { sub(/[^:]:/,""); r = (r ? r OFS : "") $0; next } x { print x, r; r="" } { x=$0; y=$1; sub(/[^:]:/,"",x) } END { print x, r }'
Any ideas to make this simpler or cleaner and to achieve the complete functionality.
Update 1: Few other examples could be:
Example 2:
12:21 11111 11:11 ASADSS 11.00 11.00
22:22 111232 22:22 BASDASD 1111 1,231.00 1,121.00
Update 2: Seems like somehow my question was not clear. So one way looking at it would be to look for: the first "time" I find on a line, the first set of alpha-numeric string and first decimal values with/without comma in it, all of them printed on the same output line. A more generic description would be, Given an input line, print the first occurrence of pattern 1, first occurrence of pattern 2 and first occurrence of pattern 3 (which itself is an "or" of two patterns) in one line in the output and must be stable (i.e; preserving the order they appeared in input). Sorry it is a little complicated example and I am also trying to learn if this is the sweet spot to leave using Unix utilities for a full language like Perl/Python. So here is the expected results for the second set of examples.
Expected Result 2:
12:21 ASADSS 11.00
22:22 BASDASD 1,231.00
Upvotes: 0
Views: 1024
Reputation: 75488
#!/usr/bin/awk -f
BEGIN {
p[0] = "^[0-9]+:[0-9]{2}$"
p[1] = "^[[:alpha:]][[:alnum:]]*$"
p[2] = "^[0-9]+[0-9,]*[.][0-9]{2}$"
}
{
i = 0
for (j = 1; j <= NF; ++j) {
for (k = 0; k in p; ++k) {
if ($j ~ p[k] && !q[k]++ && j > ++i) {
$i = $j
}
}
}
q[0] = q[1] = q[2] = 0
NF = i
print
}
Input:
12:23 23:23 ASDFGH 1,232.00 22.00
21:22 12:12 ASDSDS 22.00 21.00
12:21 11111 11:11 ASADSS 11.00 11.00
22:22 111232 22:22 BASDASD 1111 1,231.00 1,121.00
Output:
12:23 ASDFGH 1,232.00
21:22 ASDSDS 22.00
12:21 ASADSS 11.00
22:22 BASDASD 1,231.00
Upvotes: 3
Reputation: 31
Perl-regex style should solve the problem:
(\d\d:\d\d).*?([a-zA-Z]+).*?((?:\d,\d{3}\.\d\d)|(?:\d\d\.\d\d))
It will capture the following data (procesing each line You provided separately):
RESULT$VAR1 = [
'12:23',
'ASDFGH',
'1,232.00'
];
RESULT$VAR1 = [
'21:22',
'ASDSDS',
'22.00'
];
RESULT$VAR1 = [
'12:21',
'ASADSS',
'11.00'
];
RESULT$VAR1 = [
'22:22',
'BASDASD',
'1,231.00'
];
Example perl script.pl:
#!/usr/bin/perl
use strict;
use Data::Dumper;
open my $F, '<', shift @ARGV;
my @strings = <$F>;
my $qr = qr/(\d\d:\d\d).*?([a-zA-Z]+).*?((?:\d,\d{3}\.\d\d)|(?:\d\d\.\d\d))/;
foreach my $string (@strings) {
chomp $string;
next if not $string;
my @tab = $string =~ $qr;
print join(" ", @tab) . "\n";
}
Run as:
perl script.pl test_data.txt
Cheers!
Upvotes: 1