anamaria
anamaria

Reputation: 351

How to merge and transpose multiple files?

I have 1818 .CEL files that look like this:

$ head RIBBY_g_5ZCV995_BI_SNP_D04_38774.CEL
RIBBY_g_5ZCV995_BI_SNP_D04_38774.CEL
FQC-10090295         0.007813
FQC-10119363         0.023438
FQC-10132112         0.093750
...

$ head UNTIL_g_3ECO791_BI_SNP_H10_36454.CEL
UNTIL_g_3ECO791_BI_SNP_H10_36454.CEL
FQC-10090295 0.187500
FQC-10119363 0.023438
FQC-10132112 0.039063
...

so the header of the file and the file name are the same. And each of 1818 .CEL files has 3022 lines.

I am trying to merge all .CEL files by the first column (entries are the same) and later transpose that so that .CEL names are the first column and the rest of the column names are row entries of each 1818 and there is 3022 of them. So I would have on the end one file with 3023 columns and 1818 rows (and rows are .CEL file names)

so it would looks like this:

                                     FQC-10090295 FQC-10119363 FQC-10132112  ...
RIBBY_g_5ZCV995_BI_SNP_D04_38774.CEL 0.007813     0.023438     0.093750  ...
UNTIL_g_3ECO791_BI_SNP_H10_36454.CEL 0.187500     0.023438     0.039063  ...
...

I am trying to do this with this transpose.sh script:

header=

for f in *.CEL
do
    if [ -z "$header" ]
    then
        l=${#f}                 # length of filename ...
        fmt="%-${l}s"           # ... determines width of first column

        header="$(printf "$fmt" "")"    # first column of first row is     blank

        for col in $(awk '{print $1}' $f)
        do
            l=${#col}               # width of column
            [ $l -lt 8 ] && l=8     # min width of 8
            header="$header $(printf "%-${l}s" "$col")"     # append column label to header
            fmt="$fmt %-$l.6f"
        done
        printf '%s\n' "$header"         # header is first row of output
    fi

    printf "$fmt\n" "$f" $(awk '{print $2}' $f)     # print filename and all column 2 values
done

running this command:

sh transpose.sh > output.txt

The issue is that I am getting 1817 rows and 3024 columns and output.txt looks like this:

a=fread("output.txt") Warning message: In fread("output.txt") : Detected 3023 column names but the data has 3024 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.

> head(a)
                                     V1 ABAFT_g_4RWG569_BI_SNP_A01_34952.CEL
1: ABAFT_g_4RWG569_BI_SNP_A01_34952.CEL                             0.062500
2: ABAFT_g_4RWG569_BI_SNP_A02_34968.CEL                             0.023438
3: ABAFT_g_4RWG569_BI_SNP_A03_34984.CEL                             0.007813
4: ABAFT_g_4RWG569_BI_SNP_A04_35000.CEL                             0.023438
5: ABAFT_g_4RWG569_BI_SNP_A05_35016.CEL                             0.007813
6: ABAFT_g_4RWG569_BI_SNP_A06_35032.CEL                             0.132813
   FQC-10090295 FQC-10119363 FQC-10132112 FQC-10201128 FQC-10208432
1:     0.023438     0.023438     0.023438     0.023438     0.007813
2:     0.023438     0.007813     0.062500     0.007813     0.007813
3:     0.007813     0.007813     0.007813     0.023438     0.007813
4:     0.007813     0.007813     0.007813     0.007813     0.007813
5:     0.023438     0.023438     0.007813     0.023438     0.007813
6:     0.007813     0.007813     0.007813     0.007813     0.023438
....

so there is this one additional column being created: ABAFT_g_4RWG569_BI_SNP_A01_34952.CEL

and also one row is missing since there is 1817 rows and it should be 1818 rows whcih correspond to .CEL names.

UPDATE: per @markp-fuso suggestion I did replace:

for col in $(awk '{print $1}' $f)

with

for col in $(awk 'NR>=2 {print $1}' $f)

and this gave me the correct number of columns. But the number of rows is still 1817 and it should be 1818.

Upvotes: 2

Views: 108

Answers (1)

markp-fuso
markp-fuso

Reputation: 35156

Early in the question we are shown the first line of each file contains the name of the file; this means the first pass through this loop ...

for col in $(awk '{print $1}' $f)

... is going to print the first column of the first line which is the name of the first file (ie, ABAFT_g_4RWG569_BI_SNP_A01_34952.CEL).

You could try modifying this particular awk command to skip the first line like such:

for col in $(awk 'NR>=2 {print $1}' $f)

I haven't tried to understand the rest of the logic so there may be other issues.

I'd suggest updating the sample code to a) include the missing done and b) do some work on lining up the indentations to make it a bit easier to read.

Upvotes: 3

Related Questions