RonicK
RonicK

Reputation: 229

append columns to tab delimited file using AWK

I have multiple files without headers with same first four columns and different fifth column. I have to append first four common columns all fifth columns with respective headers as shown below into single final tab delimited text file using awk.

File_1.txt

chr1    101845021   101845132   A   0
chr2    128205033   128205154   B   0
chr3    128205112   128205223   C   0
chr4    36259133    36259244    D   0
chr5    36259333    36259444    E   0
chr6    25497759    25497870    F   1
chr7    25497819    25497930    G   1
chr8    25497869    25497980    H   1

File_2.txt

chr1    101845021   101845132   A   6
chr2    128205033   128205154   B   7
chr3    128205112   128205223   C   7
chr4    36259133    36259244    D   7
chr5    36259333    36259444    E   10
chr6    25497759    25497870    F   11
chr7    25497819    25497930    G   11
chr8    25497869    25497980    H   12

File_3.txt

chr1    101845021   101845132   A   41
chr2    128205033   128205154   B   41
chr3    128205112   128205223   C   42
chr4    36259133    36259244    D   43
chr5    36259333    36259444    E   47
chr6    25497759    25497870    F   48
chr7    25497819    25497930    G   48
chr8    25497869    25497980    H   49

Expected Output file Final.txt

Part    Start   End Name    File1   File2   File3
chr1    101845021   101845132   A   0   6   41
chr2    128205033   128205154   B   0   7   41
chr3    128205112   128205223   C   0   7   42
chr4    36259133    36259244    D   0   7   43
chr5    36259333    36259444    E   0   10  47
chr6    25497759    25497870    F   1   11  48
chr7    25497819    25497930    G   1   11  48
chr8    25497869    25497980    H   1   12  49

Upvotes: 0

Views: 701

Answers (2)

Jonathan Leffler
Jonathan Leffler

Reputation: 755114

Files in same order

If it is safe to assume that the rows are in the same order in each file, then you can do the job fairly succinctly with:

awk '
FILENAME != oname { FN++; oname = FILENAME }
    { p[FNR] = $1; s[FNR] = $2; e[FNR] = $3; n[FNR] = $4; f[FN,FNR] = $5; N = FNR }
END {
    printf("%-8s %-12s %-12s %-4s %-5s %-5s %-5s\n",
           "Part", "Start", "End", "Name", "File1", "File2", "File3");
    for (i = 1; i <= N; i++)
    {
        printf("%-8s %-12d %-12d %-4s %-5d %-5d %-5d\n",
               p[i], s[i], e[i], n[i], f[1,i], f[2,i], f[3,i]);
    }
}' file_1.txt file_2.txt file_3.txt

The first line spots when you start on a new file, and increments the FN variable (so lines from file 1 can be tagged with FN == 1, etc). It records the file name in oname so it can spot changes.

The second line operates on each data line, storing the first four fields in the arrays p, s, e, n (indexed by record number within the current file), and records the fifth column in f (indexed by FN and record number). It records the current record number in the current file in N.

The END block prints out the heading, then for each row in the array (indexed from 1 to N), prints out the various fields.

The output is (unsurprisingly):

Part     Start        End          Name File1 File2 File3
chr1     101845021    101845132    A    0     6     41   
chr2     128205033    128205154    B    0     7     41   
chr3     128205112    128205223    C    0     7     42   
chr4     36259133     36259244     D    0     7     43   
chr5     36259333     36259444     E    0     10    47   
chr6     25497759     25497870     F    1     11    48   
chr7     25497819     25497930     G    1     11    48   
chr8     25497869     25497980     H    1     12    49   

Files in different orders

If you can't rely on the records being in the same order in each file, you have to work harder. Assuming that the records in the first file are in the required order, the following script arranges to print the records in the order:

awk '
FILENAME != oname { FN++; oname = FILENAME }
    { key = $1 SUBSEP $2 SUBSEP $3 SUBSEP $4
      if (FN == 1)
      {   p[key] = $1; s[key] = $2; e[key] = $3; n[key] = $4; f[FN,key] = $5; k[FNR] = key; N = FNR }
      else
      {   if (key in p)
            f[FN,key] = $5
          else
              printf "Unmatched key (%s) in %s\n", key, FILENAME
      }
    }
END {
    printf("%-8s %-12s %-12s %-4s %-5s %-5s %-5s\n",
           "Part", "Start", "End", "Name", "File1", "File2", "File3")
    for (i = 1; i <= N; i++)
    {
        key = k[i]
        printf("%-8s %-12d %-12d %-4s %-5d %-5d %-5d\n",
               p[key], s[key], e[key], n[key], f[1,key], f[2,key], f[3,key])
    }
}' "$@"

This is closely based on the previous script; the FN handling is identical. The SUBSEP variable is used to separate subscripts in a multi-index array. The variable key contains the same value that would be generated by indexing an array z[$1,$2,$3,$4].

If working on the first file (FN == 1), then the values in arrays p, s, e, n are created, indexed by key. The fifth column is recorded in f similarly. The order in which the keys appear in the file are recorded in array k, indexed by the (file) record number.

If working on the second or third file, check whether the key is known, reporting if it is not. Assuming it is known, add the fifth column in f again.

The printing is similar, except it collects the keys in sequence from k, and then prints the relevant values.

Given these files:

  • file_4.txt

    chr8    25497869    25497980    H   1
    chr7    25497819    25497930    G   1
    chr6    25497759    25497870    F   1
    chr5    36259333    36259444    E   0
    chr4    36259133    36259244    D   0
    chr3    128205112   128205223   C   0
    chr2    128205033   128205154   B   0
    chr1    101845021   101845132   A   0
    
  • file_5.txt

    chr2    128205033   128205154   B   7
    chr8    25497869    25497980    H   12
    chr3    128205112   128205223   C   7
    chr1    101845021   101845132   A   6
    chr6    25497759    25497870    F   11
    chr4    36259133    36259244    D   7
    chr7    25497819    25497930    G   11
    chr5    36259333    36259444    E   10
    
  • file_6.txt

    chr5    36259333    36259444    E   47
    chr4    36259133    36259244    D   43
    chr6    25497759    25497870    F   48
    chr8    25497869    25497980    H   49
    chr2    128205033   128205154   B   41
    chr3    128205112   128205223   C   42
    chr7    25497819    25497930    G   48
    chr1    101845021   101845132   A   41
    

The script yields the output:

Part     Start        End          Name File1 File2 File3
chr8     25497869     25497980     H    1     12    49   
chr7     25497819     25497930     G    1     11    48   
chr6     25497759     25497870     F    1     11    48   
chr5     36259333     36259444     E    0     10    47   
chr4     36259133     36259244     D    0     7     43   
chr3     128205112    128205223    C    0     7     42   
chr2     128205033    128205154    B    0     7     41   
chr1     101845021    101845132    A    0     6     41   

There are many circumstances that these scripts do not accommodate very thoroughly. For example, if the files are of different lengths; if there are repeated keys; if there are keys found in one or two files not found in the other(s); if the fifth column data is not numeric; if the second and third columns are not numeric; if there are only two files, or more than three files listed. The 'not numeric' issue is actually easily fixed; simply use %s instead of %d. But the scripts are fragile. They work in the ecosystems shown, but not very generally. The necessary fixes are not incredibly hard; they are a nuisance to have to code, though.

There could be more or less than 3 files

Extending the previous script to handle an arbitrary number of files, and to output tab-separated data instead of formatted (readable) data is not very difficult.

awk '
FILENAME != oname { FN++; file[FN] = oname = FILENAME }
    { key = $1 SUBSEP $2 SUBSEP $3 SUBSEP $4
      if (FN == 1)
      {   p[key] = $1; s[key] = $2; e[key] = $3; n[key] = $4; f[FN,key] = $5; k[FNR] = key; N = FNR }
      else
      {   if (key in p)
              f[FN,key] = $5
          else
          {
              printf "Unmatched key (%s) in %s\n", key, FILENAME
              exit 1
          }

      }
    }
END {
    printf("%s\t%s\t%s\t%s", "Part", "Start", "End", "Name")
    for (i = 1; i <= FN; i++) printf("\t%s", file[i]);
    print ""
    for (i = 1; i <= N; i++)
    {
        key = k[i]
        printf("%s\t%s\t%s\t%s", p[key], s[key], e[key], n[key])
        for (j = 1; j <= FN; j++)
            printf("\t%s", f[j,key])
        print ""
    }
}' "$@"

The key point is that printf doesn't output a newline unless you tell it to do so, but print does output a newline. The code keeps a record of the actual file names for use in printing out the columns. It loops over the array of file data, assuming that there are the same number of lines in each file.

Given 6 files as input — the three original files, a copy of the first file in reverse order, and permuted copies of the second and third files, the output has 6 columns of extra data, with the columns identified:

Part    Start   End     Name    file_1.txt      file_2.txt      file_3.txt      file_4.txt      file_5.txt      file_6.txt
chr1    101845021       101845132       A       0       6       41      0       6       41
chr2    128205033       128205154       B       0       7       41      0       7       41
chr3    128205112       128205223       C       0       7       42      0       7       42
chr4    36259133        36259244        D       0       7       43      0       7       43
chr5    36259333        36259444        E       0       10      47      0       10      47
chr6    25497759        25497870        F       1       11      48      1       11      48
chr7    25497819        25497930        G       1       11      48      1       11      48
chr8    25497869        25497980        H       1       12      49      1       12      49

Upvotes: 2

oliv
oliv

Reputation: 13259

Assuming both 3 files are sorted, you can use join command:

join -o "1.1,1.2,1.3,1.4,2.5,2.6,1.5" file3 <(join -o "1.1,1.2,1.3,1.4,1.5,2.5" file1 file2)

-o option allows to format the output result with selecting certain fields from both files. 1.x and 2.x refers to the file given. For example, 1.1 refers to the first field of the first file.

Since join only accept 2 files, the bash operator <(...) is used to create a temporary file.


Another solution using paste and awk (still assuming files are sorted):

paste file* | awk '{print $1,$2,$3,$4,$5,$10,$15}'

Upvotes: 1

Related Questions