Reputation: 351
I have 1818 .CEL files that look like this:
$ head RIBBY_g_5ZCV995_BI_SNP_D04_38774.CEL
RIBBY_g_5ZCV995_BI_SNP_D04_38774.CEL
FQC-10090295 0.007813
FQC-10119363 0.023438
FQC-10132112 0.093750
...
$ head UNTIL_g_3ECO791_BI_SNP_H10_36454.CEL
UNTIL_g_3ECO791_BI_SNP_H10_36454.CEL
FQC-10090295 0.187500
FQC-10119363 0.023438
FQC-10132112 0.039063
...
so the header of the file and the file name are the same. And each of 1818 .CEL files has 3022 lines.
I am trying to merge all .CEL files by the first column (entries are the same) and later transpose that so that .CEL names are the first column and the rest of the column names are row entries of each 1818 and there is 3022 of them. So I would have on the end one file with 3023 columns and 1818 rows (and rows are .CEL file names)
so it would looks like this:
FQC-10090295 FQC-10119363 FQC-10132112 ...
RIBBY_g_5ZCV995_BI_SNP_D04_38774.CEL 0.007813 0.023438 0.093750 ...
UNTIL_g_3ECO791_BI_SNP_H10_36454.CEL 0.187500 0.023438 0.039063 ...
...
I am trying to do this with this transpose.sh script:
header=
for f in *.CEL
do
if [ -z "$header" ]
then
l=${#f} # length of filename ...
fmt="%-${l}s" # ... determines width of first column
header="$(printf "$fmt" "")" # first column of first row is blank
for col in $(awk '{print $1}' $f)
do
l=${#col} # width of column
[ $l -lt 8 ] && l=8 # min width of 8
header="$header $(printf "%-${l}s" "$col")" # append column label to header
fmt="$fmt %-$l.6f"
done
printf '%s\n' "$header" # header is first row of output
fi
printf "$fmt\n" "$f" $(awk '{print $2}' $f) # print filename and all column 2 values
done
running this command:
sh transpose.sh > output.txt
The issue is that I am getting 1817 rows and 3024 columns and output.txt looks like this:
a=fread("output.txt") Warning message: In fread("output.txt") : Detected 3023 column names but the data has 3024 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.
> head(a)
V1 ABAFT_g_4RWG569_BI_SNP_A01_34952.CEL
1: ABAFT_g_4RWG569_BI_SNP_A01_34952.CEL 0.062500
2: ABAFT_g_4RWG569_BI_SNP_A02_34968.CEL 0.023438
3: ABAFT_g_4RWG569_BI_SNP_A03_34984.CEL 0.007813
4: ABAFT_g_4RWG569_BI_SNP_A04_35000.CEL 0.023438
5: ABAFT_g_4RWG569_BI_SNP_A05_35016.CEL 0.007813
6: ABAFT_g_4RWG569_BI_SNP_A06_35032.CEL 0.132813
FQC-10090295 FQC-10119363 FQC-10132112 FQC-10201128 FQC-10208432
1: 0.023438 0.023438 0.023438 0.023438 0.007813
2: 0.023438 0.007813 0.062500 0.007813 0.007813
3: 0.007813 0.007813 0.007813 0.023438 0.007813
4: 0.007813 0.007813 0.007813 0.007813 0.007813
5: 0.023438 0.023438 0.007813 0.023438 0.007813
6: 0.007813 0.007813 0.007813 0.007813 0.023438
....
so there is this one additional column being created: ABAFT_g_4RWG569_BI_SNP_A01_34952.CEL
and also one row is missing since there is 1817 rows and it should be 1818 rows whcih correspond to .CEL names.
UPDATE: per @markp-fuso suggestion I did replace:
for col in $(awk '{print $1}' $f)
with
for col in $(awk 'NR>=2 {print $1}' $f)
and this gave me the correct number of columns. But the number of rows is still 1817 and it should be 1818.
Upvotes: 2
Views: 108
Reputation: 35156
Early in the question we are shown the first line of each file contains the name of the file; this means the first pass through this loop ...
for col in $(awk '{print $1}' $f)
... is going to print the first column of the first line which is the name of the first file (ie, ABAFT_g_4RWG569_BI_SNP_A01_34952.CEL
).
You could try modifying this particular awk
command to skip the first line like such:
for col in $(awk 'NR>=2 {print $1}' $f)
I haven't tried to understand the rest of the logic so there may be other issues.
I'd suggest updating the sample code to a) include the missing done
and b) do some work on lining up the indentations to make it a bit easier to read.
Upvotes: 3