Reputation: 229
I have multiple files without headers with same first four columns and different fifth column. I have to append first four common columns all fifth columns with respective headers as shown below into single final tab delimited text file using awk.
File_1.txt
chr1 101845021 101845132 A 0
chr2 128205033 128205154 B 0
chr3 128205112 128205223 C 0
chr4 36259133 36259244 D 0
chr5 36259333 36259444 E 0
chr6 25497759 25497870 F 1
chr7 25497819 25497930 G 1
chr8 25497869 25497980 H 1
File_2.txt
chr1 101845021 101845132 A 6
chr2 128205033 128205154 B 7
chr3 128205112 128205223 C 7
chr4 36259133 36259244 D 7
chr5 36259333 36259444 E 10
chr6 25497759 25497870 F 11
chr7 25497819 25497930 G 11
chr8 25497869 25497980 H 12
File_3.txt
chr1 101845021 101845132 A 41
chr2 128205033 128205154 B 41
chr3 128205112 128205223 C 42
chr4 36259133 36259244 D 43
chr5 36259333 36259444 E 47
chr6 25497759 25497870 F 48
chr7 25497819 25497930 G 48
chr8 25497869 25497980 H 49
Expected Output file Final.txt
Part Start End Name File1 File2 File3
chr1 101845021 101845132 A 0 6 41
chr2 128205033 128205154 B 0 7 41
chr3 128205112 128205223 C 0 7 42
chr4 36259133 36259244 D 0 7 43
chr5 36259333 36259444 E 0 10 47
chr6 25497759 25497870 F 1 11 48
chr7 25497819 25497930 G 1 11 48
chr8 25497869 25497980 H 1 12 49
Upvotes: 0
Views: 701
Reputation: 755114
If it is safe to assume that the rows are in the same order in each file, then you can do the job fairly succinctly with:
awk '
FILENAME != oname { FN++; oname = FILENAME }
{ p[FNR] = $1; s[FNR] = $2; e[FNR] = $3; n[FNR] = $4; f[FN,FNR] = $5; N = FNR }
END {
printf("%-8s %-12s %-12s %-4s %-5s %-5s %-5s\n",
"Part", "Start", "End", "Name", "File1", "File2", "File3");
for (i = 1; i <= N; i++)
{
printf("%-8s %-12d %-12d %-4s %-5d %-5d %-5d\n",
p[i], s[i], e[i], n[i], f[1,i], f[2,i], f[3,i]);
}
}' file_1.txt file_2.txt file_3.txt
The first line spots when you start on a new file, and increments the FN
variable (so lines from file 1 can be tagged with FN == 1
, etc). It records the file name in oname
so it can spot changes.
The second line operates on each data line, storing the first four fields in the arrays p
, s
, e
, n
(indexed by record number within the current file), and records the fifth column in f
(indexed by FN
and record number). It records the current record number in the current file in N
.
The END
block prints out the heading, then for each row in the array (indexed from 1 to N
), prints out the various fields.
The output is (unsurprisingly):
Part Start End Name File1 File2 File3
chr1 101845021 101845132 A 0 6 41
chr2 128205033 128205154 B 0 7 41
chr3 128205112 128205223 C 0 7 42
chr4 36259133 36259244 D 0 7 43
chr5 36259333 36259444 E 0 10 47
chr6 25497759 25497870 F 1 11 48
chr7 25497819 25497930 G 1 11 48
chr8 25497869 25497980 H 1 12 49
If you can't rely on the records being in the same order in each file, you have to work harder. Assuming that the records in the first file are in the required order, the following script arranges to print the records in the order:
awk '
FILENAME != oname { FN++; oname = FILENAME }
{ key = $1 SUBSEP $2 SUBSEP $3 SUBSEP $4
if (FN == 1)
{ p[key] = $1; s[key] = $2; e[key] = $3; n[key] = $4; f[FN,key] = $5; k[FNR] = key; N = FNR }
else
{ if (key in p)
f[FN,key] = $5
else
printf "Unmatched key (%s) in %s\n", key, FILENAME
}
}
END {
printf("%-8s %-12s %-12s %-4s %-5s %-5s %-5s\n",
"Part", "Start", "End", "Name", "File1", "File2", "File3")
for (i = 1; i <= N; i++)
{
key = k[i]
printf("%-8s %-12d %-12d %-4s %-5d %-5d %-5d\n",
p[key], s[key], e[key], n[key], f[1,key], f[2,key], f[3,key])
}
}' "$@"
This is closely based on the previous script; the FN
handling is identical. The SUBSEP
variable is used to separate subscripts in a multi-index array. The variable key
contains the same value that would
be generated by indexing an array z[$1,$2,$3,$4]
.
If working on the first file (FN == 1
), then the values in arrays p
, s
, e
, n
are created, indexed by key
. The fifth column is recorded in f
similarly. The order in which the keys appear in the file are recorded in array k
, indexed by the (file) record number.
If working on the second or third file, check whether the key is known, reporting if it is not. Assuming it is known, add the fifth column in f
again.
The printing is similar, except it collects the keys in sequence from k
, and then prints the relevant values.
Given these files:
file_4.txt
chr8 25497869 25497980 H 1
chr7 25497819 25497930 G 1
chr6 25497759 25497870 F 1
chr5 36259333 36259444 E 0
chr4 36259133 36259244 D 0
chr3 128205112 128205223 C 0
chr2 128205033 128205154 B 0
chr1 101845021 101845132 A 0
file_5.txt
chr2 128205033 128205154 B 7
chr8 25497869 25497980 H 12
chr3 128205112 128205223 C 7
chr1 101845021 101845132 A 6
chr6 25497759 25497870 F 11
chr4 36259133 36259244 D 7
chr7 25497819 25497930 G 11
chr5 36259333 36259444 E 10
file_6.txt
chr5 36259333 36259444 E 47
chr4 36259133 36259244 D 43
chr6 25497759 25497870 F 48
chr8 25497869 25497980 H 49
chr2 128205033 128205154 B 41
chr3 128205112 128205223 C 42
chr7 25497819 25497930 G 48
chr1 101845021 101845132 A 41
The script yields the output:
Part Start End Name File1 File2 File3
chr8 25497869 25497980 H 1 12 49
chr7 25497819 25497930 G 1 11 48
chr6 25497759 25497870 F 1 11 48
chr5 36259333 36259444 E 0 10 47
chr4 36259133 36259244 D 0 7 43
chr3 128205112 128205223 C 0 7 42
chr2 128205033 128205154 B 0 7 41
chr1 101845021 101845132 A 0 6 41
There are many circumstances that these scripts do not accommodate very thoroughly. For example, if the files are of different lengths; if there are repeated keys; if there are keys found in one or two files not found in the other(s); if the fifth column data is not numeric; if the second and third columns are not numeric; if there are only two files, or more than three files listed. The 'not numeric' issue is actually easily fixed; simply use %s
instead of %d
. But the scripts are fragile. They work in the ecosystems shown, but not very generally. The necessary fixes are not incredibly hard; they are a nuisance to have to code, though.
Extending the previous script to handle an arbitrary number of files, and to output tab-separated data instead of formatted (readable) data is not very difficult.
awk '
FILENAME != oname { FN++; file[FN] = oname = FILENAME }
{ key = $1 SUBSEP $2 SUBSEP $3 SUBSEP $4
if (FN == 1)
{ p[key] = $1; s[key] = $2; e[key] = $3; n[key] = $4; f[FN,key] = $5; k[FNR] = key; N = FNR }
else
{ if (key in p)
f[FN,key] = $5
else
{
printf "Unmatched key (%s) in %s\n", key, FILENAME
exit 1
}
}
}
END {
printf("%s\t%s\t%s\t%s", "Part", "Start", "End", "Name")
for (i = 1; i <= FN; i++) printf("\t%s", file[i]);
print ""
for (i = 1; i <= N; i++)
{
key = k[i]
printf("%s\t%s\t%s\t%s", p[key], s[key], e[key], n[key])
for (j = 1; j <= FN; j++)
printf("\t%s", f[j,key])
print ""
}
}' "$@"
The key point is that printf
doesn't output a newline unless you tell it to do so, but print
does output a newline. The code keeps a record of the actual file names for use in printing out the columns. It loops over the array of file data, assuming that there are the same number of lines in each file.
Given 6 files as input — the three original files, a copy of the first file in reverse order, and permuted copies of the second and third files, the output has 6 columns of extra data, with the columns identified:
Part Start End Name file_1.txt file_2.txt file_3.txt file_4.txt file_5.txt file_6.txt
chr1 101845021 101845132 A 0 6 41 0 6 41
chr2 128205033 128205154 B 0 7 41 0 7 41
chr3 128205112 128205223 C 0 7 42 0 7 42
chr4 36259133 36259244 D 0 7 43 0 7 43
chr5 36259333 36259444 E 0 10 47 0 10 47
chr6 25497759 25497870 F 1 11 48 1 11 48
chr7 25497819 25497930 G 1 11 48 1 11 48
chr8 25497869 25497980 H 1 12 49 1 12 49
Upvotes: 2
Reputation: 13259
Assuming both 3 files are sorted, you can use join
command:
join -o "1.1,1.2,1.3,1.4,2.5,2.6,1.5" file3 <(join -o "1.1,1.2,1.3,1.4,1.5,2.5" file1 file2)
-o
option allows to format the output result with selecting certain fields from both files. 1.x
and 2.x
refers to the file given. For example, 1.1
refers to the first field of the first file.
Since join
only accept 2 files, the bash operator <(...)
is used to create a temporary file.
Another solution using paste
and awk
(still assuming files are sorted):
paste file* | awk '{print $1,$2,$3,$4,$5,$10,$15}'
Upvotes: 1