BorisRu
BorisRu

Reputation: 227

Joining columns matching on multiple fields with non-equal keys between files using Bash?

I know similar questions have been asked before (e.g. Joining multiple fields in text files on Unix) but I can't seem to find a solution for my specific problem.

My files are structured as follows (col1=ID, col2=time increment, col3=data):

head file1
14.000119    0       yes
14.000119    69      yes
14.000119    168     no
14.000119    259
14.000119    431
14.000119    888     yes
head file2
14.000119    0       no
14.000119    70      no
14.000119    169     yes
14.000119    262
14.000119    456
14.000119    525

My goal is to join these two files based on ID first and Time second. However, non-equal values of Time must also be added, and the data column (col3) must be placed in the correct column in the output file.

Desired output:

14.000119    0      yes    no    
14.000119    69     yes
14.000119    70            no
14.000119    168    no    
14.000119    169           yes
14.000119    259
14.000119    262
14.000119    431
14.000119    456
14.000119    525
14.000119    888    yes

Columns are tab seperated. I know there is a solution with either awk or join but I can't seem to get it right. The closest I've come was using

awk -F\\t '{
o1=$1;o2=$2
$1=$2="";gsub("\t","")
_[o1 FS o2]=_[o1 FS o2] FS $0
} END {
for(i in _) print i,_[i]
}' file1 file2 | sort -k1,1 -k2,2 -n

Which gave me:

14.000119       0         yes      no
14.000119       69        yes
14.000119       70        no
14.000119       168       no
14.000119       169       yes
14.000119       259
14.000119       262
14.000119       431
14.000119       456
14.000119       525
14.000119       888       yes

But as you can see, the data values only get filled in the correct column (4th for file2) if the value for the same key in file1 is non-empty.

Upvotes: 0

Views: 246

Answers (3)

tshiono
tshiono

Reputation: 22022

With awk and a little help of sort, how about:

awk -F'\t' '
    NR==FNR {a[$1"\t"$2] = $3; next}
    {a[$1"\t"$2] ? a[$1"\t"$2] = a[$1"\t"$2]"\t"$3 : a[$1"\t"$2] = "\t"$3}
    END {for (i in a) print i"\t"a[i]}
' file1 file2 | sort -k1,1n -k2,2n

Result:

14.000119       0       yes     no
14.000119       69      yes
14.000119       70              no
14.000119       168     no
14.000119       169             yes
14.000119       259
14.000119       262
14.000119       431
14.000119       456
14.000119       525
14.000119       888     yes

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203532

With GNU awk for arrays of arrays and sorted_in:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ vals[$1][$2][ARGIND] = $3 }
END {
    PROCINFO["sorted_in"] = "@ind_num_asc"
    for (id in vals) {
        for (time in vals[id]) {
            print id, time, vals[id][time][1], vals[id][time][2]
        }
    }
}

$ awk -f tst.awk file1 file2
14.000119       0       yes     no
14.000119       69      yes
14.000119       70              no
14.000119       168     no
14.000119       169             yes
14.000119       259
14.000119       262
14.000119       431
14.000119       456
14.000119       525
14.000119       888     yes

Upvotes: 0

BorisRu
BorisRu

Reputation: 227

Solved it using join and some workarounds!

join -j1 -a 1 -a 2 -e '' -o '0,1.4,2.4' -t $'\t' 
<(<file1 awk -F\\t '{print $1"-"$2 "\t" $0}' | sort -k1,1) 
<(<file2 awk -F\\t '{print $1"-"$2 "\t" $0}' | sort -k1,1) 
| sed 's/-/\t/g' | sort -k1,1 -k2,2 -n

Explanation:
-j1: joins on the first field
-a 1 -a 2: also print unpairable lines from both files
-e '': fill in blanks with empty fields
-o 0,1.4,2.4: output the first field, and 4th of file 1 and file 2
-t $'\t': tab seperated
<(<file1 ...)print the first two columns with a '-' in between instead of a tab, to 'condense' the first two columns into one (join only works on one column)
sed 's/-/\t/g': revert the dash back to a tab
sort -k1,1 -k2,2 -n: Now that the output contains 4 columns again, sort on the first and second

Upvotes: 1

Related Questions