Suraj Muraleedharan
Suraj Muraleedharan

Reputation: 1344

bash remove duplicate of key values with preserving order

I have 2 files which I need to combine and generate a 3rd file. Please find the sample below,

File 1

xab=p11
aab=p12
aac=p23
xac=p15
yab=p16

File 2

aab=p17
xac=p25
yyc=p22

I would like to preserve the order of the first file and append the second file. The result should be:

File 3

xab=p11
aab=p17
aac=p23
xac=p25
yab=p16
yyc=p22

I tried many ways, but not able to get a simpler, easy to understandable solution. The one I found in StackOverflow was working, but it is hard to understand and explain to a third person. The solution I found was

cat en_us.txt en_US2.txt | tr -s '\n' | awk -F= '!a[$1]{b[++i]=$1} {a[$1]=$0;} END{for(j=1;j<=i;j++){print a[b[j]]}}'

Can anyone try this and get a readable solution (probably one not using awk)

Upvotes: 1

Views: 1164

Answers (6)

dahu
dahu

Reputation: 329

Given that the original request was for something more readable than awk, here are a few Tcl solutions.

  1. A verbose version written for clarity.
    #!/usr/bin/env tclsh
    package require fileutil
    foreach file $argv {
        fileutil::foreachLine line $file {
            lassign [split $line =] key value
            dict set data $key $value
        }
    }
    dict for {key value} $data {
        puts $key=$value
    }

The only two lines that might not seem obvious are:

  • lassign which takes the list in its first argument and creates variables with the names of its remaining arguments (variable destructuring).
  • dict set which adds a new entry to a dictionary (hash map / associative array) named data with the given key/value pair. Tcl will automatically create nonexistent variables the first time they are assigned to.

Tcl dictionaries preserve the insertion order (similar to Ruby as mentioned in other answers).

  1. A slightly more idiomatic version.
    #!/usr/bin/env tclsh
    package require fileutil
    foreach file $argv {
        fileutil::foreachLine line $file {
            dict set data {*}[split $line =]
        }
    }
    dict for {k v} $data {puts $k=$v}

The confusing line here uses the splat operator {*} which expands (explodes) a list as individual arguments, thus saving us the need to create temporary variables to hold the key/value pairs.

  1. Owh, an awk-ish version.
cat f1 f2 | owh '' 'dict set data {*}[split $0 =]' 'dict for {k v} $data {puts $k=$v}'
  • It's named owh in homage to awk, for the names Ousterhout (original creator of Tcl), Welch and Hobbs, two prolific contributors to early Tcl.
  • It takes three scripts: BEGIN, each-line, and END (without the need to name them so).
  • Like awk, it uses $0 to represent the whole input line.
  1. Another Tcl Awk, Tawk.
tawk -F = 'line {dict set data $F(1) $F(2)}; END {dict for {k v} $data {puts $k=$v}}' f1 f2
  • Which represents Awk's $1 as $F(1).

and a whole-line preserving version (if formatting is more complicated to reproduce than the simple x=y here):

tawk -F = 'line {dict set data $F(1) $F(0)}; END {puts [join [dict values $data] \n]}' f1 f2

Where:

  • $F(0) is the whole input line.
  • The dict values command returns a list of items like xab=p11 , aab=p17 , ...
  • Tcl doesn't need to quote string values like the \n (newline) in the join command.

Upvotes: 1

dawg
dawg

Reputation: 103874

Since ruby hashes maintain insertion order, you can just maintain a hash of the keys and update that key if a new value is seen:

ruby -F= -ane 'BEGIN{h=Hash.new()}
        h[$F[0]]=$F[1].rstrip
        END{h.map{|l| puts l.join("=")}}' f1.txt f2.txt

Prints:

xab=p11
aab=p17
aac=p23
xac=p25
yab=p16
yyc=p22

Upvotes: 2

Sundeep
Sundeep

Reputation: 23667

Another awk solution:

$ awk -F'=' '{ if($1 in b) a[b[$1]]=$0;
               else{a[++i]=$0; b[$1]=i} }
             END{for(j=1;j<=i;j++) print a[j]}' f1 f2
xab=p11
aab=p17
aac=p23
xac=p25
yab=p16
yyc=p22
  • Note that both files are processed together as single input here, no NR==FNR stuff
  • else{a[++i]=$0; b[$1]=i} this code is executed if first column isn't seen before
    • a[++i]=$0 this saves the line content based on numerical key
    • b[$1]=i this array helps to get the numerical key number based on first column
  • if($1 in b) a[b[$1]]=$0 this is executed when first column already exists
    • a[b[$1]]=$0 this will update the earlier entry
  • END{for(j=1;j<=i;j++) print a[j]} print the array content after all input lines have been processed

With ruby, it is easier as the insertion order is retained by default.

$ ruby -F'=' -lane 'BEGIN{h={}}; h[$F[0]]=$_; END{puts h.values}' f1 f2
xab=p11
aab=p17
aac=p23
xac=p25
yab=p16
yyc=p22
  • BEGIN{h={}}; assign empty hash to variable h
  • h[$F[0]]=$_ save contents of input line based on first field
  • puts h.values print values of each hash key

You can save some space by using h[$F[0]]=$F[1] and then END{h.each_key{|k| puts "#{k}=#{h[k]}"}}

Upvotes: 10

Ed Morton
Ed Morton

Reputation: 203665

$ cat tst.awk
BEGIN { FS=OFS="=" }
{ key=$1; val=$2 }
NR==FNR {
    keys[++numKeys] = key
    key2val[key] = val
    next
}
{
    if ( key in key2val ) {
        val = key2val[key]
        delete key2val[key]
    }
    print key, val
}
END {
    for (keyNr=1; keyNr<=numKeys; keyNr++) {
        key = keys[keyNr]
        if (key in key2val) {
            print key, key2val[key]
        }
    }
}

$ awk -f tst.awk file2 file1
xab=p11
aab=p17
aac=p23
xac=p25
yab=p16
yyc=p22

Upvotes: 4

RavinderSingh13
RavinderSingh13

Reputation: 133538

EDIT: In case you want to maintain the order of both the Input_file(s) then try following.

awk '
BEGIN{
  FS=OFS="="
}
FNR==NR{
  if(!($1 in d)){
    e[++count]=$1
  }
  a[$1]=$2
  next
}
{
  print $1,($1 in a?a[$1]:$2)
  c[$1]
}
END{
  for(i=1;i<=count;i++){
    if(!(e[i] in c)){ print e[i],a[e[i]] }
  }
}
'  Input_file2  Input_file1


Could you please try following, written and tested with shown samples(this will not take care of order of Input_file2 lines).

awk '
BEGIN{
  FS=OFS="="
}
FNR==NR{
  a[$1]=$2
  next
}
{
  print $1,($1 in a?a[$1]:$2)
  c[$1]
}
END{
  for(i in a){
    if(!(i in c)){
      print i,a[i]
    }
  }
}
'  Input_file2  Input_file1

Upvotes: 2

anubhava
anubhava

Reputation: 785256

You may use this awk command:

awk 'BEGIN {
   FS=OFS="="
}
FNR == NR {
   a[$1] = $2
   next
}
$1 in a {
   $2 = a[$1]
   delete a[$1]
}
1
END {
   for (i in a)
      print i, a[i]
}' file2 file1
xab=p11
aab=p17
aac=p23
xac=p25
yab=p16
yyc=p22

Upvotes: 1

Related Questions