grindbert
grindbert

Reputation: 183

Create dictionary from Fasta file

I recently picked up a program i started some time ago (sorting and listing of gene code) and since I am a beginner and couldn't find anything about this specific problem online i will need some help.

I want to crate a dictionary from a Fasta (fna) file that looks like this: (actual file uploaded, the editor distorts everything i copy&paste; I hope that is allowed): http://www.filedropper.com/firstreads http://pastebin.com/NNXV09A7 <- smallReads

I know how to make a dictionary manually but i have no idea how i could combine it with reading the dictionary entries from a file. I appreciate any help.

edit: manual dict from example above:

dict= {'ATTC': 'T', 'CATT': 'C'}

or using the constructor:

dict([('ATTC', 'T'), ('CATT', 'C')])

edit2: Thanks to Byte Commander i was able to make the function by adding the parameters:

def makeSuffixDict (inputfile="smallReads.fna", n=15):

my_dict = {}

with open(inputfile) as file:

    for line in file:

        word = line.split()[-1]

        my_dict[word[:-n]] = word[-n:]

return()


if __name__ == "__main__":
makeSuffixDict
for keys,values in my_dict.items():
    print(keys)
    print(values)
    print('\n')

However, when I try to change the suffix length, i get the same result. How can i make the suffix length variable?

There is also one small issue and that's the "'': '5'" at the beginning. why is it there and can i make it not show up?

{'': '5', 'ATTG': 'T', 'GCCC': 'T', 'TTTT': 'T', 'AGTC': 'C'}

Similar to this when I try another file with 30000 reads instead of 5 every now and then numbers pop up and I have no clue where they come from. Example:

CAAGATCTAATATGAATTACAGAGAGCTGTTCAGCAAATACTTGTTGCATCAATGGAATTACAGCAGTAACACATATATTGACCTGGAACCAGAATCATGTTCTGAATGCAGAAGTACGTACTTTCTTTTTCTTTCTTGAGAACGCTGGATCTTTTTTAAAATGTTAATTTGCAGTTTGAAGCTGTTTAGGTTAAAAAAAAAATACAAGAAGCAGCAGCAAAAGAGACC : A

2407 : 9

ATTCTTTCATACCATTAAATATTTATTTTTCAAAACTGATCTTAGTAGAGGCCTAGTACTGTCTCATATAAATATAGGATAATATATATAATAAATCCCCTGACATCAGACATTAAGGTTACTCCCAATTACTTATTATCTTTATATATATGTTAAAAATATGTGTGTATAATATGTAAGTAAACAATTTGCATAGTTTATATGTGGTAATATATGGTTAATATATAGG : C

Upvotes: 0

Views: 1114

Answers (1)

Byte Commander
Byte Commander

Reputation: 6746

# create an empty dictionary:
my_dict = {}

# open the file:
with open("file.fna") as file:

    # read the file line by line:
    for line in file:

        # split at whitespace and take the last word:
        word = line.split()[-1]

        # add entry to dictionary: all but the last character -> key; last character -> value
        my_dict[word[:-1]] = word[-1]

See this code running on ideone.com (reading from STDIN instead of file though...)

Update:

If you want variable length suffixes, replace the last line above with this, where n is the length of the suffix which becomes the dictionary value:

        my_dict[word[:-n]] = word[-n:]

See this code running on ideone.com (reading from STDIN instead of file though...)


Update 2:

Your code as stated in the question has some problems with the indentation. Also, there are no braces after return, but you need them to call a function. You need to return the dictionary created in the function as well to work with it outside.

I also now parse the 3rd whitespace-separated word in each row instead of the last one. Lines with not exactly 3 words are ignored.

Here's my version:

def makeSuffixDict (inputfile="smallReads.fna", n=15):
    my_dict = {}
    with open(inputfile) as file:
        for line in file:
            words = line.split()
            if len(words) == 3:  # <-- replaced "last word" with "3rd word"
                word = words[2]
                my_dict[word[:-n]] = word[-n:]
    return my_dict

if __name__ == "__main__":
    my_dict = makeSuffixDict()
    for key,value in my_dict.items():
        print(key, value)

Upvotes: 1

Related Questions