calculating the mean of columns in text files

Question

I have two folders named f1 and f2. These folders contain 300 text files with 2 columns.The content of files are shown below.I would like to calculate the mean of second column.file names are same in both folders.

file1 in f1 folder
54  6
55 10
57 5

file2 in f1 folder
24  8
28  12

file1 in f2 folder
34 3
22 8

file2 in f2 folder
24  8
28  13

output

                   folder1       folder2

     file1         21/3= 7       11/2=5.5
     file2         20/2=10       21/2=10.5

      --            --             --    

      --            --             --

    file300         --             --

              total mean of folder1 = sum of the means/3oo
              total mean of folder2 = sum of the means/3oo

Jonathan Leffler · Accepted Answer

I'd do it with two awk scripts. (Originally, I had a sort phase in the middle, but that isn't actually necessary. However, I think that two scripts is probably easier than trying to combine them into one. If someone else does it 'all in one' and it is comprehensible, then choose their solution instead.)

Sample run and output

This is based on the 4 files shown in the question. The names of the files are listed on the command line, but the order doesn't matter. The code assumes that there is only one slash in the file names, and no spaces and the like in the file names.

$ awk -f summary1.awk f?/* | awk -f summary2.awk
file1                      21/3   =      7.000       11/2   =      5.500
file2                      20/2   =     10.000       21/2   =     10.500
total mean of f1     =     17/2   =      8.500
total mean of f2     =     16/2   =      8.000

summary1.awk

function print_data(file, sum, count) {
    sub("/", " ", file);
    print file, sum, count;
}
oldfile != FILENAME { if (count > 0) { print_data(oldfile, sum, count); }
                      count = 0; sum = 0; oldfile = FILENAME
                    }
                    { count++; sum += $2 }
END                 { print_data(oldfile, sum, count) }

This processes each file in turn, summing the values in column 2 and counting the number of lines. It prints out the folder name, the file name, the sum and the count.

summary2.awk

    {
        sum[$2,$1] = $3
        cnt[$2,$1] = $4
        if (file[$2]++ == 0) file_list[n1++] = $2
        if (fold[$1]++ == 0) fold_list[n2++] = $1
    }
END {   for (i = 0; i < n1; i++)
        { 
            printf("%-20s", file_list[i])
            name = file_list[i]
            for (j = 0; j < n2; j++)
            {
                folder = fold_list[j]
                s = sum[name,folder]
                n = cnt[name,folder]
                a = (s + 0.0) / n
                printf("   %6d/%-3d = %10.3f", s, n, a)
                gsum[folder] += a
            }
            printf("
")
        }
        for (i = 0; i < n2; i++)
        {
            folder = fold_list[i]
            s = gsum[folder]
            n = n1;
            a = (s + 0.0) / n
            printf("total mean of %-6s = %6d/%-3d = %10.3f
", folder, s, n, a)
        }
    }

The file associative array tracks references to file names. The file_list array keeps the file names in the order that they're read. Similarly, the fold associative array tracks the folder names, and the fold_list array keeps track of the folder names in the order that they appear. If you do something weird enough with the order that you supply the names to the first command, you may need to insert a sort command between the two awk commands, such as sort -k2,2 -k1,1.

The sum associative array contains the sum for a given file name and folder name. The cnt associative array contains the count for a given file name and folder name.

The END section of the report has two main loops (though the first loop contains a nested loop). The first main loop processes the files in the order presented, generating one line containing one entry for each folder. It also accumulates the averages for the folder name. The second main loop generates the 'total mean` data for each folder. I'm not sure whether the statistics makes sense (shouldn't the overall mean for folder1 be the sum of the values in folder1 divided by the number of entries, or 41/5 = 8.2 rather than 17/2 or 8.5?), but the calculation does what I think the question asks for (sum of means / number of files, written as 300 in the question).

calculating the mean of columns in text files

Answers (2)

Sample run and output

summary1.awk

summary2.awk

Related Questions