Madza Farias-Virgens
Madza Farias-Virgens

Reputation: 1071

Merge files print 0 in empty field

I have 5 tab delim files file 0 is basically a key

A 
C 
F 
AA 
BC 
CC 
D 
KKK 
S

file1

A 2 
C 3 
F 5 
AA 5 
BC 4 
D 7

file2

A 2
C 3
F 7
D 10

file3

A 2
C 2
F 5
CC 4
D 7

file4

A 1
C 3
F 5
CC 4
D 7
KKK 10

I would like to merge all files based on the 1st column and print 0 in missing fields.

A 2 2 2 1
C 3 3 2 3
F 5 7 5 5
AA 5 0 0 0
BC 4 0 0 0
CC 0 0 4 4
D 7 10 7 7
KKK 0 0 0 10
S 0 0 0 0

Columns must keep the order of input file0, file1, file2, file3, file4

Upvotes: 1

Views: 234

Answers (5)

Thor
Thor

Reputation: 47189

You could use coreutils join to determine missing fields and add them to each file:

sort file0 > file0.sorted
for file in file[1-4]; do 
  { 
    cat $file
    join -j 1 -v 1 file0.sorted <(sort $file) | sed 's/$/ 0/'
  } | sort > $file.sorted
done

Now you just need to paste them together:

paste file0.sorted \
  <(cut -d' ' -f2 file1.sorted) \
  <(cut -d' ' -f2 file2.sorted) \
  <(cut -d' ' -f2 file3.sorted) \
  <(cut -d' ' -f2 file4.sorted)

Output:

A   2   2   2   1
AA  5   0   0   0
BC  4   0   0   0
C   3   3   2   3
CC  0   0   4   4
D   7   10  7   7
F   5   7   5   5
KKK 0   0   0   10
S   0   0   0   0

Upvotes: 1

Thor
Thor

Reputation: 47189

With GNU awk you can use the ENDFILE clause to make sure you have enough elements in all rows, e.g.:

parse.awk

BEGIN { OFS = "\t" }

# Collect all information into the `h` hash
{ h[$1] = (ARGIND == 1 ? $1 : h[$1] OFS $2) }

# At the end of each file do the necessary padding
ENDFILE {
  for(k in h) {
    elems = split(h[k], a, OFS)
    if (elems !=  ARGIND)
      h[k] = h[k] OFS 0
  }
}

# Print the content of `h`
END {
  for(k in h)
    print h[k]
}

Run it like this:

awk -f parse.awk file[0-4]

Output:

AA  5   0   0   0
A   2   2   2   1
C   3   3   2   3
D   7   10  7   7
BC  4   0   0   0
CC  0   0   4   4
S   0   0   0   0
KKK 0   0   0   10
F   5   7   5   5

NB: This solution assumes you only have two columns per file (except the first one).

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 204310

I was going to wait til you included your own attempt in your question but since you have 2 answers already anyway....

$ cat tst.awk
NR==FNR {
    key2rowNr[$1] = ++numRows
    rowNr2key[numRows] = $1
    next
}
FNR==1 { ++numCols }
{
    rowNr = key2rowNr[$1]
    vals[rowNr,numCols] = $2
}
END {
    for (rowNr=1; rowNr<=numRows; rowNr++) {
        printf "%s", rowNr2key[rowNr]
        for (colNr=1; colNr<=numCols; colNr++) {
            printf "%s%d", OFS, vals[rowNr,colNr]
        }
        print ""
    }
}

$ awk -f tst.awk file0 file1 file2 file3 file4
A 2 2 2 1
C 3 3 2 3
F 5 7 5 5
AA 5 0 0 0
BC 4 0 0 0
CC 0 0 4 4
D 7 10 7 7
KKK 0 0 0 10
S 0 0 0 0

Upvotes: 2

Shawn
Shawn

Reputation: 52569

Not awk, but this sort of joining of files on a common field is exactly what join is meant for. Complicated a bit by it only working with two files at a time; you have to pipe the results of each one into the next as the first file.

$ join -o 0,2.2 -e0 -a1 <(sort file0) <(sort file1) \
  | join -o 0,1.2,2.2 -e0 -a1 - <(sort file2) \
  | join -o 0,1.2,1.3,2.2 -e0 -a1 - <(sort file3) \
  | join -o 0,1.2,1.3,1.4,2.2 -e0 -a1 - <(sort file4) \
  | tr ' ' '\t'
A       2       2       2       1
AA      5       0       0       0
BC      4       0       0       0
C       3       3       2       3
CC      0       0       4       4
D       7       10      7       7
F       5       7       5       5
KKK     0       0       0       10
S       0       0       0       0

Caveats: This requires a shell like bash or zsh that understands <(command) redirection. Sorting all the files in advance is an alternative. Or as pointed out, even though join normally requires its input files to be sorted on the column that's being joined on, it works anyways without the sorts for this particular input.

Upvotes: 2

Baba
Baba

Reputation: 850

awk solution

awk '
    FNR==1{f++}
    {
        a[f""$1]=$2
        b[$1]++
    }
    END{
        for(i in b){
            printf i" "
            for(j=1;j<=f;j++){
                tmp=j""i
                if(tmp in a){
                    printf a[tmp]" "
                }else{
                    printf 0" "
                }
            }
            print ""
        }
    }
' file*

oupput :

A  2 2 2 1 
AA  5 0 0 0 
BC  4 0 0 0 
C  3 3 2 3 
CC  0 0 4 4 
D  7 10 7 7 
F  5 7 5 5 
KKK  0 0 0 10 
S  0 0 0 0

first i store every value for each file number and key value in variable a then store all uniqe key in variable b

and in END block, checked if key is exists or not, if exists print it OR not exist print 0

we can delete file0, if delete it, awk show only exists key in file1,2,3,4,..

Upvotes: 2

Related Questions