BioFreak
BioFreak

Reputation: 23

comparison of two files made on the basis of two columns in but retaining two duplicate lines with a pattern

file1:

scaffold2232_size19577   gene       8878    9258
scaffold2232_size19577   CDS        8878    9258
scaffold2232_size19577   gene       10631   14562
scaffold2232_size19577   intron     10693   11242
scaffold2232_size19577   intron     11343   14252
scaffold2232_size19577   intron     14346   14499
scaffold2232_size19577   CDS        10631   10692
scaffold2232_size19577   CDS        11243   11342
scaffold2232_size19577   CDS        14253   14345
scaffold2232_size19577   CDS        14500   14562
scaffold2232_size19577   gene       18807   19055
scaffold2232_size19577   CDS        18807   19055

file2:

scaffold2232_size19577   8878   9258    Os12t0508300-01
scaffold2232_size19577   8878   9258    Os12t0508300-01
scaffold2232_size19577   10631  14562   Os12t0508300-01
scaffold2232_size19577   10693  11242   Os12t0508300-01
scaffold2232_size19577   11343  14252   Os12t0508300-01
scaffold2232_size19577   14346  14499   Os12t0508400-00
scaffold2232_size19577   14346  14499   Os12t0508400-00
scaffold2232_size19577   14346  14499   Os12t0508400-00
scaffold2232_size19577   10631  10692   Os12t0508300-01
scaffold2232_size19577   11243  11342   Os12t0508300-01
scaffold2232_size19577   14253  14345   Os12t0508400-00
scaffold2232_size19577   14253  14345   Os12t0508400-00
scaffold2232_size19577   14253  14345   Os12t0508400-00
scaffold2232_size19577   14500  14562   Os12t0508400-00
scaffold2232_size19577   14500  14562   Os12t0508400-00
scaffold2232_size19577   14500  14562   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00

desired output:

scaffold2232_size19577   8878   9258    Os12t0508300-01 gene
scaffold2232_size19577   8878   9258    Os12t0508300-01 CDS 
scaffold2232_size19577   10631  14562   Os12t0508300-01 gene
scaffold2232_size19577   10693  11242   Os12t0508300-01 intron
scaffold2232_size19577   11343  14252   Os12t0508300-01 intron
scaffold2232_size19577   14346  14499   Os12t0508400-00 intron
scaffold2232_size19577   10631  10692   Os12t0508300-01 CDS
scaffold2232_size19577   11243  11342   Os12t0508300-01 CDS
scaffold2232_size19577   14253  14345   Os12t0508400-00 CDS
scaffold2232_size19577   14500  14562   Os12t0508400-00 CDS
scaffold2232_size19577   18807  19055   Os12t0508400-00 gene
scaffold2232_size19577   18807  19055   Os12t0508400-00 CDS

i tried doing: awk '{a[$1,$2,$3]=$0}END{for(i in a) print a[i]}' file2

but with this i am loosing one of the gene/CDS line as they have same co-ordinates in col[2],[3] so the output is coming:

scaffold2232_size19577    8878  9258    Os12t0508300-01 
scaffold2232_size19577   10631  14562   Os12t0508300-01 
scaffold2232_size19577   10693  11242   Os12t0508300-01
scaffold2232_size19577   11343  14252   Os12t0508300-01
scaffold2232_size19577   14346  14499   Os12t0508400-00
scaffold2232_size19577   10631  10692   Os12t0508300-01
scaffold2232_size19577   11243  11342   Os12t0508300-01
scaffold2232_size19577   14253  14345   Os12t0508400-00
scaffold2232_size19577   14500  14562   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00

i thought i can later add the col[2] of file1 to file2 but the number of rows are less after this operation of awk, so i am unable to add them. i want this to be like my desired output.

Upvotes: 0

Views: 40

Answers (1)

Jotne
Jotne

Reputation: 41446

Something like this?

awk 'FNR==NR {a[$2FS$3]=$4;next} {print $1,$3,$4,a[$3FS$4],$2}' OFS="\t" f2 f1
scaffold2232_size19577  8878    9258    Os12t0508300-01 gene
scaffold2232_size19577  8878    9258    Os12t0508300-01 CDS
scaffold2232_size19577  10631   14562   Os12t0508300-01 gene
scaffold2232_size19577  10693   11242   Os12t0508300-01 intron
scaffold2232_size19577  11343   14252   Os12t0508300-01 intron
scaffold2232_size19577  14346   14499   Os12t0508400-00 intron
scaffold2232_size19577  10631   10692   Os12t0508300-01 CDS
scaffold2232_size19577  11243   11342   Os12t0508300-01 CDS
scaffold2232_size19577  14253   14345   Os12t0508400-00 CDS
scaffold2232_size19577  14500   14562   Os12t0508400-00 CDS
scaffold2232_size19577  18807   19055   Os12t0508400-00 gene
scaffold2232_size19577  18807   19055   Os12t0508400-00 CDS

Upvotes: 1

Related Questions