sumodds
sumodds

Reputation: 427

Matching multiple patterns in the same line using unix utilities

I am trying to find this pattern match. I want to match and display only the first of the matches in the same line. And one of the matches, the fourth field can be match either of the two patterns i.e; A,BCD.EF or AB.CD . An example would be

Example 1:
12:23 23:23 ASDFGH 1,232.00 22.00
21:22 12:12 ASDSDS 22.00 21.00 

The expected output would be

Expected Result 1:
12:23 ASDFGH 1,232.00
21:22 ASDSDS 22.00

I have got this far using my little knowledge of grep and stackoverflow.

< test_data.txt grep -one "[0-9]/[0-9][0-9]\|[0-9]*,[0-9]*.[0-9][0-9]\|[0-9]*.[0-9][0-9]" | awk -F ":" '$1 == y { sub(/[^:]:/,""); r = (r ? r OFS : "") $0; next } x { print x, r; r="" } { x=$0; y=$1; sub(/[^:]:/,"",x) } END { print x, r }'

Any ideas to make this simpler or cleaner and to achieve the complete functionality.

Update 1: Few other examples could be:

Example 2:
12:21 11111 11:11 ASADSS 11.00 11.00
22:22 111232 22:22 BASDASD 1111 1,231.00 1,121.00
  1. There could be more fields in some lines.
  2. The order of fields are not necessarily preserved either. I could get around this by treating the files which have different order separately or transforming them to this order somehow. So this condition can be relaxed.

Update 2: Seems like somehow my question was not clear. So one way looking at it would be to look for: the first "time" I find on a line, the first set of alpha-numeric string and first decimal values with/without comma in it, all of them printed on the same output line. A more generic description would be, Given an input line, print the first occurrence of pattern 1, first occurrence of pattern 2 and first occurrence of pattern 3 (which itself is an "or" of two patterns) in one line in the output and must be stable (i.e; preserving the order they appeared in input). Sorry it is a little complicated example and I am also trying to learn if this is the sweet spot to leave using Unix utilities for a full language like Perl/Python. So here is the expected results for the second set of examples.

Expected Result 2:
12:21 ASADSS 11.00
22:22 BASDASD 1,231.00

Upvotes: 0

Views: 1024

Answers (2)

konsolebox
konsolebox

Reputation: 75488

#!/usr/bin/awk -f

BEGIN {
    p[0] = "^[0-9]+:[0-9]{2}$"
    p[1] = "^[[:alpha:]][[:alnum:]]*$"
    p[2] = "^[0-9]+[0-9,]*[.][0-9]{2}$"
}

{
    i = 0
    for (j = 1; j <= NF; ++j) {
        for (k = 0; k in p; ++k) {
            if ($j ~ p[k] && !q[k]++ && j > ++i) {
                $i = $j
            }
        }
    }
    q[0] = q[1] = q[2] = 0
    NF = i
    print
}

Input:

12:23 23:23 ASDFGH 1,232.00 22.00
21:22 12:12 ASDSDS 22.00 21.00 
12:21 11111 11:11 ASADSS 11.00 11.00
22:22 111232 22:22 BASDASD 1111 1,231.00 1,121.00

Output:

12:23 ASDFGH 1,232.00
21:22 ASDSDS 22.00
12:21 ASADSS 11.00
22:22 BASDASD 1,231.00

Upvotes: 3

robert.r
robert.r

Reputation: 31

Perl-regex style should solve the problem:

(\d\d:\d\d).*?([a-zA-Z]+).*?((?:\d,\d{3}\.\d\d)|(?:\d\d\.\d\d))

It will capture the following data (procesing each line You provided separately):

RESULT$VAR1 = [
          '12:23',
          'ASDFGH',
          '1,232.00'
        ];
RESULT$VAR1 = [
          '21:22',
          'ASDSDS',
          '22.00'
        ];
RESULT$VAR1 = [
          '12:21',
          'ASADSS',
          '11.00'
        ];
RESULT$VAR1 = [
          '22:22',
          'BASDASD',
          '1,231.00'
        ];

Example perl script.pl:

#!/usr/bin/perl
use strict;
use Data::Dumper;

open my $F, '<', shift @ARGV;

my @strings = <$F>;
my $qr = qr/(\d\d:\d\d).*?([a-zA-Z]+).*?((?:\d,\d{3}\.\d\d)|(?:\d\d\.\d\d))/;

foreach my $string (@strings) {
    chomp $string;
    next if not $string;
    my @tab = $string =~ $qr;
    print join(" ", @tab) . "\n";
}

Run as:

perl script.pl test_data.txt

Cheers!

Upvotes: 1

Related Questions