mike
mike

Reputation: 49

Replace strings by matching to a file

I have two huge files (with more than 1000 rows).

File-1

 head File-1
1_10    PL14
1_13    GH13
13_12   GH20
13_137  GH10
13_35   GT19
14_128  GH36
14_131  GH42
14_65   GH109
15_28   GT30
15_30   GH13
16_3    CE1

File-2

head File-2
gene_id HK.1.bam HK.2.bam HK.Hu.bam HKSW.bam UHK.1.bam UHK.2.bam UHK.Hu.1.bam gene_name
1_1 0.069 0.0169 2.826 0 0.004 0.019 0.054 450
1_10 0.030 0.016 2.114 0 0.001 0.000 0.072 2055
1_11 0.012 0.014 1.739 0 0 0 0.0237 171
1_12 0.082 0.071 3.763 0.021 0 0.014 0.102 357
1_13 0.035 0.01 3.836 0 0 0 0.103 234
1_14 0.054 0.031 2.844 0.006 0.005 0.001 0.082 1125

I want to map File-1 with File-2 to get without printing the last column from File-2. It will be better if I can learn to get output as Output-1 and Output-2

Output-1

gene_id HK.1.bam HK.2.bam HK.Hu.bam HKSW.bam UHK.1.bam UHK.2.bam UHK.Hu.1.bam gene_name
1_1 0.069 0.0169 2.826 0 0.004 0.019 0.054 450
PL14 0.030 0.016 2.114 0 0.001 0.000 0.072 2055
1_11 0.012 0.014 1.739 0 0 0 0.0237 171
1_12 0.082 0.071 3.763 0.021 0 0.014 0.102 357
GH13 0.035 0.01 3.836 0 0 0 0.103 234
1_14 0.054 0.031 2.844 0.006 0.005 0.001 0.082 1125

Output-2 (unmapped rows are not printed)

gene_id HK.1.bam HK.2.bam HK.Hu.bam HKSW.bam UHK.1.bam UHK.2.bam UHK.Hu.1.bam gene_name
PL14 0.030 0.016 2.114 0 0.001 0.000 0.072 2055
GH13 0.035 0.01 3.836 0 0 0 0.103 234

I tried:

awk '
NR==FNR {                      
    a[$1]=$2                    
    next                       
}
{                               
    print (($1 in a)?a[$1]:$1, $2, $3, $4, $5,$6, $7, $8) 
}' File-1 File-2 > Output

But the Output just shows the content of File-2.

Corrections to my awk code or any other suggestions (sed, Perl) will be appreciated.

Upvotes: 0

Views: 77

Answers (1)

ufopilot
ufopilot

Reputation: 3975

awk '
   NR==FNR{                        # process File1  
      a[$1]=$2;                    # map File1 columns
      next                         # next line 
   } 
   {                               # process File2
     NF--                          # delete last column
   }                       
   FNR==1{                         # first line from File2  
      print > "Output1";           # write header to Output1/2
      print > "Output2"; 
      next                         # next line 
   } 
   !($1 in a){                     # mapped false  
      print > "Output1"            # write unmapped to Output1
   } 
   ($1 in a){                      # mapped true
      $1=a[$1];                    # modify $1 and write mapped to Output1/2 
      print > "Output2";          
      print > "Output1"
}' File1 File2


$ head Output1 Output2
==> Output1 <==
gene_id HK.1.bam HK.2.bam HK.Hu.bam HKSW.bam UHK.1.bam UHK.2.bam UHK.Hu.1.bam 
1_1 0.069 0.0169 2.826 0 0.004 0.019 0.054 
PL14 0.030 0.016 2.114 0 0.001 0.000 0.072 
1_11 0.012 0.014 1.739 0 0 0 0.0237 
1_12 0.082 0.071 3.763 0.021 0 0.014 0.102 
GH13 0.035 0.01 3.836 0 0 0 0.103 
1_14 0.054 0.031 2.844 0.006 0.005 0.001 0.082 

==> Output2 <==
gene_id HK.1.bam HK.2.bam HK.Hu.bam HKSW.bam UHK.1.bam UHK.2.bam UHK.Hu.1.bam 
PL14 0.030 0.016 2.114 0 0.001 0.000 0.072 
GH13 0.035 0.01 3.836 0 0 0 0.103 

Upvotes: 1

Related Questions