Reputation: 227
I know similar questions have been asked before (e.g. Joining multiple fields in text files on Unix) but I can't seem to find a solution for my specific problem.
My files are structured as follows (col1=ID, col2=time increment, col3=data):
head file1
14.000119 0 yes
14.000119 69 yes
14.000119 168 no
14.000119 259
14.000119 431
14.000119 888 yes
head file2
14.000119 0 no
14.000119 70 no
14.000119 169 yes
14.000119 262
14.000119 456
14.000119 525
My goal is to join these two files based on ID first and Time second. However, non-equal values of Time must also be added, and the data column (col3) must be placed in the correct column in the output file.
Desired output:
14.000119 0 yes no
14.000119 69 yes
14.000119 70 no
14.000119 168 no
14.000119 169 yes
14.000119 259
14.000119 262
14.000119 431
14.000119 456
14.000119 525
14.000119 888 yes
Columns are tab seperated. I know there is a solution with either awk or join but I can't seem to get it right. The closest I've come was using
awk -F\\t '{
o1=$1;o2=$2
$1=$2="";gsub("\t","")
_[o1 FS o2]=_[o1 FS o2] FS $0
} END {
for(i in _) print i,_[i]
}' file1 file2 | sort -k1,1 -k2,2 -n
Which gave me:
14.000119 0 yes no
14.000119 69 yes
14.000119 70 no
14.000119 168 no
14.000119 169 yes
14.000119 259
14.000119 262
14.000119 431
14.000119 456
14.000119 525
14.000119 888 yes
But as you can see, the data values only get filled in the correct column (4th for file2) if the value for the same key in file1 is non-empty.
Upvotes: 0
Views: 246
Reputation: 22022
With awk
and a little help of sort
, how about:
awk -F'\t' '
NR==FNR {a[$1"\t"$2] = $3; next}
{a[$1"\t"$2] ? a[$1"\t"$2] = a[$1"\t"$2]"\t"$3 : a[$1"\t"$2] = "\t"$3}
END {for (i in a) print i"\t"a[i]}
' file1 file2 | sort -k1,1n -k2,2n
Result:
14.000119 0 yes no
14.000119 69 yes
14.000119 70 no
14.000119 168 no
14.000119 169 yes
14.000119 259
14.000119 262
14.000119 431
14.000119 456
14.000119 525
14.000119 888 yes
Upvotes: 0
Reputation: 203532
With GNU awk for arrays of arrays and sorted_in:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ vals[$1][$2][ARGIND] = $3 }
END {
PROCINFO["sorted_in"] = "@ind_num_asc"
for (id in vals) {
for (time in vals[id]) {
print id, time, vals[id][time][1], vals[id][time][2]
}
}
}
$ awk -f tst.awk file1 file2
14.000119 0 yes no
14.000119 69 yes
14.000119 70 no
14.000119 168 no
14.000119 169 yes
14.000119 259
14.000119 262
14.000119 431
14.000119 456
14.000119 525
14.000119 888 yes
Upvotes: 0
Reputation: 227
Solved it using join and some workarounds!
join -j1 -a 1 -a 2 -e '' -o '0,1.4,2.4' -t $'\t'
<(<file1 awk -F\\t '{print $1"-"$2 "\t" $0}' | sort -k1,1)
<(<file2 awk -F\\t '{print $1"-"$2 "\t" $0}' | sort -k1,1)
| sed 's/-/\t/g' | sort -k1,1 -k2,2 -n
Explanation:
-j1
: joins on the first field
-a 1 -a 2
: also print unpairable lines from both files
-e ''
: fill in blanks with empty fields
-o 0,1.4,2.4
: output the first field, and 4th of file 1 and file 2
-t $'\t'
: tab seperated
<(<file1 ...)
print the first two columns with a '-' in between instead of a tab, to 'condense' the first two columns into one (join only works on one column)
sed 's/-/\t/g'
: revert the dash back to a tab
sort -k1,1 -k2,2 -n
: Now that the output contains 4 columns again, sort on the first and second
Upvotes: 1