bongboy
bongboy

Reputation: 157

Awk to compare All the files in a directory & display the frequency of occurrence

Suppose in a directory I have 3 files, File 1, File 2 & File 3. with same header name Is it possible in awk to compare & write the frequency of occurrence

File 1

C1  C2  C3  C4
a   d   a   d
a   d   a   d
a   d   a   d



File 2

C1  C2  C3  C4
a   d   a   d
a   v   a   d
a   d   a   d

File 3 

C1  C2  C3  C4
a   d   r   d
a   f   a   d
a   d   a   d

Step 1 compare File 1 & File 2

Temp.output

C1  C2  C3  C4
0   0   0   0
0   1   0   0
0   0   0   0

Then the compare File 2 & File 3 & overwrite Temp.output with the frequency

Final.Output
C1  C2  C3  C4
0   0   1   0
0   2   0   0
0   0   0   0

the original directory may contain multiple files, and i want each of them process in orderly manner, ie. File1.txt with file2.txt then file2.txt with file3.txt

Upvotes: 0

Views: 130

Answers (2)

Ramón Gil Moreno
Ramón Gil Moreno

Reputation: 829

Let me suggest you to convert your input files into lines. With this, you can apply awk easily.

The paste -s <file> command is your ally. Below you can see how sort your files sorted and convert them to lines:

$ cat File1.txt 
C1  C2  C3  C4
a   d   a   d
a   d   a   d
a   d   a   d
$ ls
File1.txt  File2.txt  File3.txt
$ ls | sort
File1.txt
File2.txt
File3.txt
$ ls | sort | xargs -L 1 -I {} /bin/bash -c 'echo -n {}" "; paste -s {}'
File1.txt C1  C2  C3  C4    a   d   a   d   a   d   a   d   a   d   a   d
File2.txt C1  C2  C3  C4    a   d   a   d   a   v   a   d   a   d   a   d
File3.txt C1  C2  C3  C4    a   d   r   d   a   f   a   d   a   d   a   d
$ 

Once into lines, you can use awk to iterate the fields (NF will tell you how many are there). We will use several rules.

For every line, you will compare if the field at i is different from the previous saved value and increment the result accordingly. Skip comparing the results for the first line with the (NR != 1) selector.

(NR != 1) { for (i = 1; i <= NF; i++) { if (last[i] != $i) { result[i]++; } } }

In the same awk call, include the rule that updates the array where you keep the last values:

{ for (i = 1; i <= NF; i++) { last[i] = $i  }  }

Finally printout the file and the status of the results:

{ printf("%s", $1); for (i = 1; i <= NF; i++) { printf(" %d", result[i]) } print "" }

Here you is the whole command:

$ ls | sort | xargs -L 1 -I {} /bin/bash -c 'echo -n {}" "; paste -s {}' | awk '(NR != 1) { for (i = 1; i <= NF; i++) { if (last[i] != $i) { result[i]++; } } } { for (i = 1; i <= NF; i++) { last[i] = $i  }  } { printf("%s", $1); for (i = 1; i <= NF; i++) { printf(" %d", result[i]) } print "" }'
File1.txt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
File2.txt 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
File3.txt 2 0 0 0 0 0 0 1 0 0 2 0 0 0 0 0 0
$ 

This output starts with the filename, then the accumulated differences in:

  • The filename (always different)
  • The columns (C1, C2, C3, C4 always the same)
  • Then your 12 values. Your relevant data starts at field 7.

You can format that back again with awk inserting new lines when appropriate:

awk '{ print ""; printf("%s", $1); for (i = 7; i <= NF; i++) { if (((i - 7) % 4) == 0) print "" ; printf(" %d", $i) } print "" }'

Here you have a complete run:

$ ls | sort | xargs -L 1 -I {} /bin/bash -c 'echo -n {}" "; paste -s {}' | awk '(NR != 1) { for (i = 1; i <= NF; i++) { if (last[i] != $i) { result[i]++; } } } { for (i = 1; i <= NF; i++) { last[i] = $i  }  } { printf("%s", $1); for (i = 1; i <= NF; i++) { printf(" %d", result[i]) } print "" }' | awk '{ print ""; printf("%s", $1); for (i = 7; i <= NF; i++) { if (((i - 7) % 4) == 0) print "" ; printf(" %d", $i) } print "" }'

File1.txt
 0 0 0 0
 0 0 0 0
 0 0 0 0

File2.txt
 0 0 0 0
 0 1 0 0
 0 0 0 0

File3.txt
 0 0 1 0
 0 2 0 0
 0 0 0 0
$ 

Upvotes: 1

Vishal
Vishal

Reputation: 283

Please find the awk script below. row = 4 includes header as well

#!/bin/bash
/usr/bin/awk '{print $0;}' /tmp/file* | awk -v row=4 -v col=4 '
{
    x = (NR - 1)%row;
    for(i = 1; i <= NF; i++){
        if(a[x, i] != $i){
            a[x, i] = $i;
            count[x, i]++;
        }
    }
}END{
    for(i = 1; i <= row-1; i++){
        for(j = 1; j <= col; j++){
            printf (count[i, j]-1)" ";
        }
        printf "\n";
    }
}'
#

Below script is to print each iterations

#!/bin/bash

/usr/bin/awk '{print $0;}' /tmp/stack/file* | awk -v row=4 -v col=4 '
{
    x = (NR - 1)%row;
    for(i = 1; i <= NF; i++){
        if(a[x, i] != $i){
            a[x, i] = $i;
            count[x, i]++;
        }
    }

    for(i = 1; i <= row-1; i++){
                for(j = 1; j <= col; j++){
                        printf (count[i, j]-1)" ";
                }
                printf "\n";
        }
    print "***********";

}END{
    for(i = 1; i <= row-1; i++){
        for(j = 1; j <= col; j++){
            printf (count[i, j]-1)" ";
        }
        printf "\n";
    }
}'

Upvotes: 0

Related Questions