Apex
Apex

Reputation: 1096

how to compare two files with allowing for mismatch bash

I have two files as below:

f1:

>seq11
TCAGATGTGTATAAGAGACAGGATTCTTCCACGGTTATTGAGAGTANGCGAGAA
>seq95
TCAGATGTGTATAAGAGACAGTACGTCTTGGTGACTATATCGAGGCNGAATGAA
>seq11
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGACTGCGCTAAGCGGCTACTTCGCATACT
>seq11
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGCAAGACCACTTGTGGCCGTTCGCATACT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGTCCGAGCTTGCCGAACAGTATTCAGCCT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGCGGAGAACCGTGCTCTCATATTCAGCCT

f2:

>seq11
TCAGATGTGTATAAGAGACAGGATTCTTCCACGGTTATTGAGAGTATGCGAGAA
>seq95
TCAGATGTGTATAAGAGACAGTACGTCTTGGTGACTATATCGAGGCTGAATGAA
>seq11
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGACTTCAACGTGTCAGGCCGTCGCATACT
>seq11
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGACTGCGCTAAGCGGCTACTTCGCATACT
>seq11
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGCAAGACCACTTGTGGCCGTTCGCATACT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGCCTCTGCGCTAACGAGAGTATTCAGCCT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGTCCGAGCTTGCCGAACAGTATTCAGCCT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGCGGAGAACCGTGCTCTCATATTCAGCCT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGGTAAGACCGTCTGCACCGTATTCAGCCT

f1 is a subset file of f2 - I want to print all the lines of f1 but then the shorter line of f1 has an N character that should be replaced with its original character based on the f2 file. so the desired output should be:

>seq11
TCAGATGTGTATAAGAGACAGGATTCTTCCACGGTTATTGAGAGTATGCGAGAA
>seq95
TCAGATGTGTATAAGAGACAGTACGTCTTGGTGACTATATCGAGGCTGAATGAA
>seq11
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGACTGCGCTAAGCGGCTACTTCGCATACT
>seq11
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGCAAGACCACTTGTGGCCGTTCGCATACT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGTCCGAGCTTGCCGAACAGTATTCAGCCT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGCGGAGAACCGTGCTCTCATATTCAGCCT

I know the grep -f f2.fa f1.fa but have not been able to ignore the N mismatch.

How can I do this?

Thank you in advance.

Upvotes: 1

Views: 132

Answers (1)

Ed Morton
Ed Morton

Reputation: 203209

Try this using GNU awk for arrays of arrays:

$ cat tst.awk
BEGIN {
    split("T C A G",tmp)
    for ( i in tmp ) {
        chars[tmp[i]]
    }
    fullLength = length("GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGACTGCGCTAAGCGGCTACTTCGCATACT")
}

/>/ {
    key = $1
    next
}

{ currLength = length($0) }

NR == FNR {
    if ( currLength < fullLength ) {
        shortStrings[key][$0]
    }
    next
}

currLength == fullLength {
    print key ORS $0
    next
}

key in shortStrings {
    delete currStrings
    currStrings[$0]
    if ( pos = index($0,"N") ) {
        for ( char in chars ) {
            currStrings[substr($0,1,pos-1) char substr($0,pos+1)]
        }
    }
    for ( string in currStrings ) {
        if ( string in shortStrings[key] ) {
            print key ORS string
        }
    }
}

$ awk -f tst.awk f2 f1
>seq11
TCAGATGTGTATAAGAGACAGGATTCTTCCACGGTTATTGAGAGTATGCGAGAA
>seq95
TCAGATGTGTATAAGAGACAGTACGTCTTGGTGACTATATCGAGGCTGAATGAA
>seq11
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGACTGCGCTAAGCGGCTACTTCGCATACT
>seq11
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGCAAGACCACTTGTGGCCGTTCGCATACT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGTCCGAGCTTGCCGAACAGTATTCAGCCT
>seq95
GAGATTATGTGGGAAAGTTCATGGAATCGAGCGGAGATGTGTATAAGAGACAGCGGAGAACCGTGCTCTCATATTCAGCCT

Upvotes: 3

Related Questions