BioMan
BioMan

Reputation: 704

Create new column that merges two columns

I have 100 files with similar format (shown below). I want to add a new column to every files that combines the column mir and seq separated by an _ This has to be done automatically for all files. All my files have these names:

Miraligner_240G.txt.mirna
Miraligner_239R.txt.mirna
....

They differ between _ and .txt.mirna

My input files look like this:

seq                     name          freq  mir           start end mism add    t5  t3       s5         s3    DB  ambiguity
TGAGAAGAAGCACTGTAGCTCTT seq_100006_x2   2   hsa-miR-143-3p  61  81  6AT u-TT    0   0   AGTCTGAG    GCTCAGGA    miRNA   1
GACCCTGTAGATCCGAATTTGTA seq_100012_x2   2   hsa-miR-10a-5p  22  43  1GT u-A 0   u-G TATATACC    TGTGTAAG    miRNA   1
GACCCTGTAGATCCGAATTTGTG seq_100013_x35  35  hsa-miR-10a-5p  22  44  1GT 0   0   0   TATATACC    TGTGTAAG    miRNA   1
TTAGGGCCCTGGCTCCATCT    seq_100019_x13  13  hsa-miR-1296-5p 16  35  0   0   0   u-CC    TGGGTTAG    CTCCTTTA    miRNA   1
GTGAACGGGCGCCATCCCGAGGCTT   seq_100029_x2   2   hsa-miR-887-3p  48  72  0   0   0   d-CTT   TGGAGTGA    GAGGCTTT    miRNA   1
ACCCGGTAGATCCGAATTTGTG  seq_10002_x5    5   hsa-miR-10a-5p  23  44  5GT 0   d-T 0   TATATACC    TGTGTAAG    miRNA   1
CAACGGAATCCCAAAAGCAGCTGAAAA seq_100031_x3   3   hsa-miR-191-5p  16  39  24AT    u-AAA   0   d-T CGGGCAAC    GCTGTTGT    miRNA   1
TAGTGCAATATTGCTTATAGGGTAT   seq_100033_x2   2   hsa-miR-454-3p  64  86  0   u-AT    0   0   TGAGTAGT    GGGTTTTG    miRNA   1
CAACGGAATCCGAAAAGCAGCTG seq_100037_x16  16  hsa-miR-191-5p  16  38  12GC    0   0   0   CGGGCAAC    GCTGTTGT    miRNA   1

My output files should look like this:

mir_seq                                  seq                        name          freq  mir           start end mism add t5 t3       s5         s3    DB  ambiguity
hsa-miR-143-3p_TGAGAAGAAGCACTGTAGCTCTT   TGAGAAGAAGCACTGTAGCTCTT    seq_100006_x2   2   hsa-miR-143-3p  61  81  6AT u-TT    0   0   AGTCTGAG    GCTCAGGA    miRNA   1
....
....

Upvotes: 0

Views: 1791

Answers (1)

Rakholiya Jenish
Rakholiya Jenish

Reputation: 3223

Using awk you can add a column containing values of column 1 and 4 as:

awk '{print $4"_"$1, $0}' filename

OFS will print the value of output field separator variable.

On piping the output to column -t:

mir_seq                                     seq                          name            freq  mir              start  end  mism  add    t5   t3     s5        s3        DB     ambiguity
hsa-miR-143-3p_TGAGAAGAAGCACTGTAGCTCTT      TGAGAAGAAGCACTGTAGCTCTT      seq_100006_x2   2     hsa-miR-143-3p   61     81   6AT   u-TT   0    0      AGTCTGAG  GCTCAGGA  miRNA  1
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTA      GACCCTGTAGATCCGAATTTGTA      seq_100012_x2   2     hsa-miR-10a-5p   22     43   1GT   u-A    0    u-G    TATATACC  TGTGTAAG  miRNA  1
hsa-miR-10a-5p_GACCCTGTAGATCCGAATTTGTG      GACCCTGTAGATCCGAATTTGTG      seq_100013_x35  35    hsa-miR-10a-5p   22     44   1GT   0      0    0      TATATACC  TGTGTAAG  miRNA  1
hsa-miR-1296-5p_TTAGGGCCCTGGCTCCATCT        TTAGGGCCCTGGCTCCATCT         seq_100019_x13  13    hsa-miR-1296-5p  16     35   0     0      0    u-CC   TGGGTTAG  CTCCTTTA  miRNA  1
hsa-miR-887-3p_GTGAACGGGCGCCATCCCGAGGCTT    GTGAACGGGCGCCATCCCGAGGCTT    seq_100029_x2   2     hsa-miR-887-3p   48     72   0     0      0    d-CTT  TGGAGTGA  GAGGCTTT  miRNA  1
hsa-miR-10a-5p_ACCCGGTAGATCCGAATTTGTG       ACCCGGTAGATCCGAATTTGTG       seq_10002_x5    5     hsa-miR-10a-5p   23     44   5GT   0      d-T  0      TATATACC  TGTGTAAG  miRNA  1
hsa-miR-191-5p_CAACGGAATCCCAAAAGCAGCTGAAAA  CAACGGAATCCCAAAAGCAGCTGAAAA  seq_100031_x3   3     hsa-miR-191-5p   16     39   24AT  u-AAA  0    d-T    CGGGCAAC  GCTGTTGT  miRNA  1
hsa-miR-454-3p_TAGTGCAATATTGCTTATAGGGTAT    TAGTGCAATATTGCTTATAGGGTAT    seq_100033_x2   2     hsa-miR-454-3p   64     86   0     u-AT   0    0      TGAGTAGT  GGGTTTTG  miRNA  1
hsa-miR-191-5p_CAACGGAATCCGAAAAGCAGCTG      CAACGGAATCCGAAAAGCAGCTG      seq_100037_x16  16    hsa-miR-191-5p   16     38   12GC  0      0    0      CGGGCAAC  GCTGTTGT  miRNA  1

Since awk don't have inline editing option, you will have to use gawk for inline editing. Using awk, you can write the output to temporary file and then move/copy/rename it to original file name.

For using the command in multiple file:

for i in Miraligner_*.txt.mirna; do
    awk '{print $4"_"$1, $0}' "$i" | column -t;
done

If you are using gawk and interested in doing inline editing, use gawk -i inplace


Using perl:

perl -ane 'print "$F[3]_$F[0] $_";' filename | column -t

If you want to write to file, use -i option:

perl -ane 'print "$F[3]_$F[0] $_";' -i filename

Separate all the input fields and the appended column (field) with \t:

perl -ane '$"="\t"; print "$F[3]_$F[0] @F\n";' -i filename

If you want your output in your file in proper tabular form:

for i in Miraligner_*.txt.mirna; do
    awk '{print $4"_"$1, $0}' "$i" | column -t > temp && mv temp "$i";
done

This will give output separated into proper column in your file. For this, you won't need inline editing option.


Thanks to @EdMorton for correcting my mistakes.

Upvotes: 4

Related Questions