juggy_gales
juggy_gales

Reputation: 35

Bash or Awk Script to Merge Lines Where X Number of Fields Match While Creating Ranges in Fields that Don't Match

NEED: I have a file containing data like the sample below. I need to:

  1. Merge all lines into one line when X number of fields match
  2. Create range of values in X number of fields when the values vary

In this case: Merge all lines into one where fields $1 through $6 and $8 match, create a range of the associated values in field $7

OR

Merge all lines into one where fields $1 through $7 match, create a range of the associated values in field $8*

  1. The solution has to be something native to Linux (like a bash or awk script) that doesn't require installation of additional software (e.g. datamash).

@rtx13's solution below using TCL does work (thanks again), I'm just not sure if I can install TCL in my live environment so I hope and AWK/BASH/etc. solution can also be proposed.

Original data:

HOST    FILTER  INTERFACE   SOURCE  DESTINATION PROTOCOL    SOURCE PORT DESTINATION PORT
host1   input   nic1        ip1     ip2         PROT        30000       10
host1   input   nic1        ip1     ip2         PROT        50000       10
host1   input   nic1        ip1     ip2         PROT        60000       10
host1   input   nic1        ip3     ip2         PROT        10          30000
host1   input   nic1        ip3     ip2         PROT        10          50000
host1   input   nic1        ip3     ip2         PROT        10          60000
host1   output  nic1        ip2     ip1         PROT        10          30000
host1   output  nic1        ip2     ip1         PROT        10          50000
host1   output  nic1        ip2     ip1         PROT        10          60000
host1   output  nic1        ip2     ip3         PROT        30000       10
host1   output  nic1        ip2     ip3         PROT        60000       10
host1   output  loc         ip2     ip2         PROT        10          30000
host1   output  loc         ip2     ip2         PROT        10          50000

Desired Processed Output:

host1   input   nic1        ip1     ip2         PROT        30000:60000 10
host1   input   nic1        ip3     ip2         PROT        10          30000:60000
host1   output  nic1        ip2     ip1         PROT        10          30000:60000
host1   output  nic1        ip2     ip3         PROT        30000:60000 10
host1   output  loc         ip2     ip2         PROT        10          30000:50000

Do I need a separate file that lists the ports I'd like to merge on? Since these could also fall inside a range as well? So the script could reference it to know whether it's a merge port or not?

10|EXAMPLE
22|SSH
80|HTTP
2049|NFS
*etc*

I'm trying to be as specific as I can be since my original post was deemed to need more focus; it received an answer (Thanks you @Enrico Maria De Angelis) that was close, but not quite all the way there as well as one that used datamash (Thank you @oguz ismail), which worked great but sadly I need a solution not requiring installation of additonal software: Merge all lines that are identical aside from a key field and make key field a range

Upvotes: 2

Views: 87

Answers (2)

rtx13
rtx13

Reputation: 2610

The following works on both columns. Provide the list of ports to merge on via command line.


mergecolumn2 script:

#!/usr/bin/tclsh

# read data in and store into 'output' array and 'row_key' list
set rows [lrange [split [read -nonewline stdin] "\n"] 1 end]
foreach row $rows {
    if { [lsearch -exact $argv [lindex $row 6]] >= 0 } {
        set merge_column 7
    } elseif { [lsearch -exact $argv [lindex $row 7]] >= 0 } {
        set merge_column 6
    } else {
        set merge_column -1 ;# do not merge
    }
    if { $merge_column >= 0 } {
        set value [lindex $row $merge_column]
        set row [lreplace $row $merge_column $merge_column "%s"]
    } else {
        set value ""
        set row [list {*}$row] ;# trim spaces for consistency
    }
    if { ![info exist output($row)] } {
        lappend row_key $row
    }
    lappend output($row) $value
}

# iterate over 'row_key' and generate output
foreach row $row_key {
    if { [llength $output($row)] > 1 } {
        set o [lreplace [lsort -dictionary $output($row)] 1 end-1]
    } else {
        set o $output($row)
    }
    puts [format $row [join $o :]]
}

Terminal session:

$ ./mergecolumn2 < data
host1 input nic1 ip1 ip2 PROT 30000 10
host1 input nic1 ip1 ip2 PROT 50000 10
host1 input nic1 ip1 ip2 PROT 60000 10
host1 input nic1 ip3 ip2 PROT 10 30000
host1 input nic1 ip3 ip2 PROT 10 50000
host1 input nic1 ip3 ip2 PROT 10 60000
host1 output nic1 ip2 ip1 PROT 10 30000
host1 output nic1 ip2 ip1 PROT 10 50000
host1 output nic1 ip2 ip1 PROT 10 60000
host1 output nic1 ip2 ip3 PROT 30000 10
host1 output nic1 ip2 ip3 PROT 60000 10
host1 output loc ip2 ip2 PROT 10 30000
host1 output loc ip2 ip2 PROT 10 50000
$ ./mergecolumn2 10 22 80 2049 < data
host1 input nic1 ip1 ip2 PROT 30000:60000 10
host1 input nic1 ip3 ip2 PROT 10 30000:60000
host1 output nic1 ip2 ip1 PROT 10 30000:60000
host1 output nic1 ip2 ip3 PROT 30000:60000 10
host1 output loc ip2 ip2 PROT 10 30000:50000
$ 

Upvotes: 1

rtx13
rtx13

Reputation: 2610

The following works on one column at a time.


mergecolumn script:

#!/usr/bin/tclsh
# check for column number
if { ![llength $argv] || ![regexp {^\d+$} [lindex $argv 0] merge_column] } {
    puts stderr "usage: $argv0 column (0-based)"
    exit 1
}

# read data in and store into 'output' array and 'row_key' list
set rows [lrange [split [read -nonewline stdin] "\n"] 1 end]
foreach row $rows {
    set value [lindex $row $merge_column]
    set row [lreplace $row $merge_column $merge_column "%s"]
    if { ![info exist output($row)] } {
        lappend row_key $row
    }
    lappend output($row) $value
}

# iterate over 'row_key' and generate output
foreach row $row_key {
    if { [llength $output($row)] > 1 } {
        set o [lreplace [lsort -dictionary $output($row)] 1 end-1]
    } else {
        set o $output($row)
    }
    puts [format $row [join $o :]]
}

Terminal session:

$ cat data
HOST    FILTER  INTERFACE   SOURCE  DESTINATION PROTOCOL    SOURCE PORT DESTINATION PORT
host1   input   nic1        ip1     ip2         PROT        30000       10
host1   input   nic1        ip1     ip2         PROT        50000       10
host1   input   nic1        ip1     ip2         PROT        60000       10
host1   input   nic1        ip3     ip2         PROT        10          30000
host1   input   nic1        ip3     ip2         PROT        10          50000
host1   input   nic1        ip3     ip2         PROT        10          60000
host1   output  nic1        ip2     ip1         PROT        10          30000
host1   output  nic1        ip2     ip1         PROT        10          50000
host1   output  nic1        ip2     ip1         PROT        10          60000
host1   output  nic1        ip2     ip3         PROT        30000       10
host1   output  nic1        ip2     ip3         PROT        60000       10
host1   output  loc         ip2     ip2         PROT        10          30000
host1   output  loc         ip2     ip2         PROT        10          50000
$ ./mergecolumn 6 < data
host1 input nic1 ip1 ip2 PROT 30000:60000 10
host1 input nic1 ip3 ip2 PROT 10 30000
host1 input nic1 ip3 ip2 PROT 10 50000
host1 input nic1 ip3 ip2 PROT 10 60000
host1 output nic1 ip2 ip1 PROT 10 30000
host1 output nic1 ip2 ip1 PROT 10 50000
host1 output nic1 ip2 ip1 PROT 10 60000
host1 output nic1 ip2 ip3 PROT 30000:60000 10
host1 output loc ip2 ip2 PROT 10 30000
host1 output loc ip2 ip2 PROT 10 50000
$ ./mergecolumn 7 < data
host1 input nic1 ip1 ip2 PROT 30000 10
host1 input nic1 ip1 ip2 PROT 50000 10
host1 input nic1 ip1 ip2 PROT 60000 10
host1 input nic1 ip3 ip2 PROT 10 30000:60000
host1 output nic1 ip2 ip1 PROT 10 30000:60000
host1 output nic1 ip2 ip3 PROT 30000 10
host1 output nic1 ip2 ip3 PROT 60000 10
host1 output loc ip2 ip2 PROT 10 30000:50000
$ 

Upvotes: 1

Related Questions