user3065582
user3065582

Reputation: 13

Same column of different files into the same new file

I have multiple folders Case-1, Case-2....Case-N and they all have a file named PPD. I want to extract all 2nd columns and put them into one file named 123.dat. It seems that I cannot use awk in a for loop.

case=$1
for (( i = 1; i <= $case ; i ++ ))
do
    file=Case-$i
    cp $file/PPD temp$i.dat

    awk 'FNR==1{f++}{a[f,FNR]=$2}
         END
         {for(x=1;x<=FNR;x++)
             {for(y=1;y<ARGC;y++)
             printf("%s ",a[y,x]);print ""} }'  

    temp$i.dat >> 123.dat   
done

Now 123.dat only has the date of the last PPD in Case-N

I know I can use join(I used that command before) if every PPD file has at least one column the same, but it turns out to be extremely slow if I have lots of Case folders

Upvotes: 0

Views: 147

Answers (3)

Kannan Mohan
Kannan Mohan

Reputation: 1840

The below AWK program can help you.

#!/usr/bin/awk -f

BEGIN {
    # Defaults
    nrecord=1
    nfiles=0
}

BEGINFILE {
    # Check if the input file is accessible,
    # if not skip the file and print error.
    if (ERRNO != "") {
        print("Error: ",FILENAME, ERRNO)
        nextfile
    }
}

{
    # Check if the file is accessed for the first time
    # if so then increment nfiles. This is to keep count of
    # number of files processed.
    if ( FNR == 1 ) {
        nfiles++
    } else if (FNR > nrecord) {
        # Fetching the maximum size of the record processed so far.
        nrecord=FNR
    }

    # Fetch the second column from the file.
    array[nfiles,FNR]=$2

}

END {
    # Iterate through the array and print the records.
    for (i=1; i<=nrecord; i++) {
        for (j=1; j<=nfiles; j++) {
            printf("%5s", array[j,i])
        }
        print ""
    }
}

Output:

$ ./get.awk Case-*/PPD
    1   11   21
    2   12   22
    3   13   23
    4   14   24
    5   15   25
    6   16   26
    7   17   27
    8   18   28
    9   19   29
   10   20   30

Here the Case*/PPD expands to Case-1/PPD, Case-2/PPD, Case-3/PPD and so on. Below are the source files for which the output was generated.

$ cat Case-1/PPD 
1   1   1   1
2   2   2   2
3   3   3   3
4   4   4   4
5   5   5   5
6   6   6   6
7   7   7   7
8   8   8   8
9   9   9   9
10  10  10  10
$ cat Case-2/PPD 
11  11  11  11
12  12  12  12
13  13  13  13
14  14  14  14
15  15  15  15
16  16  16  16
17  17  17  17
18  18  18  18
19  19  19  19
20  20  20  20
$ cat Case-3/PPD 
21  21  21  21
22  22  22  22
23  23  23  23
24  24  24  24
25  25  25  25
26  26  26  26
27  27  27  27
28  28  28  28
29  29  29  29
30  30  30  30

Upvotes: 2

danfuzz
danfuzz

Reputation: 4353

The interaction between the outer shell script and inner awk invocation aren't working the way you expect.

Every time through the loop, the shell script calls awk a new time, which means that f will be unset, and then that first clause will set it to 1. It will never become 2. That is, you are starting a new awk process for each iteration through the outer loop, and awk is starting from scratch each time.

There are other ways to structure your code, but as a minimal tweak, you can pass in the number $i to the awk invocation using the -v option, e.g. awk -v i="$i" ....

Note that there are better ways to structure your overall solution, as other answerers have already suggested; I meant this response to be an answer the question, "Why doesn't this work?" and not "Please rewrite this code."

Upvotes: 2

tripleee
tripleee

Reputation: 189357

Maybe

eval paste $(printf ' <(cut -f2 %s)' Case-*/PPD)

There is probably a limit to how many process substitutions you can perform in one go. I did this with 20 columns and it was fine. Process substitutions are a Bash feature, so not portable to other Bourne-compatible shells in general.

The wildcard will be expanded in alphabetical order. If you want the cases in numerical order, maybe use case-[1-9] case-[1-9][0-9] case-[1-9][0-9][0-9] to force the expansion to get the single digits first, then the double digits, etc.

Upvotes: 2

Related Questions