Reputation: 183
I recently picked up a program i started some time ago (sorting and listing of gene code) and since I am a beginner and couldn't find anything about this specific problem online i will need some help.
I want to crate a dictionary from a Fasta (fna) file that looks like this: (actual file uploaded, the editor distorts everything i copy&paste; I hope that is allowed): http://www.filedropper.com/firstreads http://pastebin.com/NNXV09A7 <- smallReads
I know how to make a dictionary manually but i have no idea how i could combine it with reading the dictionary entries from a file. I appreciate any help.
edit: manual dict from example above:
dict= {'ATTC': 'T', 'CATT': 'C'}
or using the constructor:
dict([('ATTC', 'T'), ('CATT', 'C')])
edit2: Thanks to Byte Commander i was able to make the function by adding the parameters:
def makeSuffixDict (inputfile="smallReads.fna", n=15):
my_dict = {}
with open(inputfile) as file:
for line in file:
word = line.split()[-1]
my_dict[word[:-n]] = word[-n:]
return()
if __name__ == "__main__":
makeSuffixDict
for keys,values in my_dict.items():
print(keys)
print(values)
print('\n')
However, when I try to change the suffix length, i get the same result. How can i make the suffix length variable?
There is also one small issue and that's the "'': '5'" at the beginning. why is it there and can i make it not show up?
{'': '5', 'ATTG': 'T', 'GCCC': 'T', 'TTTT': 'T', 'AGTC': 'C'}
Similar to this when I try another file with 30000 reads instead of 5 every now and then numbers pop up and I have no clue where they come from. Example:
CAAGATCTAATATGAATTACAGAGAGCTGTTCAGCAAATACTTGTTGCATCAATGGAATTACAGCAGTAACACATATATTGACCTGGAACCAGAATCATGTTCTGAATGCAGAAGTACGTACTTTCTTTTTCTTTCTTGAGAACGCTGGATCTTTTTTAAAATGTTAATTTGCAGTTTGAAGCTGTTTAGGTTAAAAAAAAAATACAAGAAGCAGCAGCAAAAGAGACC : A
2407 : 9
ATTCTTTCATACCATTAAATATTTATTTTTCAAAACTGATCTTAGTAGAGGCCTAGTACTGTCTCATATAAATATAGGATAATATATATAATAAATCCCCTGACATCAGACATTAAGGTTACTCCCAATTACTTATTATCTTTATATATATGTTAAAAATATGTGTGTATAATATGTAAGTAAACAATTTGCATAGTTTATATGTGGTAATATATGGTTAATATATAGG : C
Upvotes: 0
Views: 1114
Reputation: 6746
# create an empty dictionary:
my_dict = {}
# open the file:
with open("file.fna") as file:
# read the file line by line:
for line in file:
# split at whitespace and take the last word:
word = line.split()[-1]
# add entry to dictionary: all but the last character -> key; last character -> value
my_dict[word[:-1]] = word[-1]
See this code running on ideone.com (reading from STDIN instead of file though...)
Update:
If you want variable length suffixes, replace the last line above with this, where n
is the length of the suffix which becomes the dictionary value:
my_dict[word[:-n]] = word[-n:]
See this code running on ideone.com (reading from STDIN instead of file though...)
Update 2:
Your code as stated in the question has some problems with the indentation. Also, there are no braces after return
, but you need them to call a function. You need to return the dictionary created in the function as well to work with it outside.
I also now parse the 3rd whitespace-separated word in each row instead of the last one. Lines with not exactly 3 words are ignored.
Here's my version:
def makeSuffixDict (inputfile="smallReads.fna", n=15):
my_dict = {}
with open(inputfile) as file:
for line in file:
words = line.split()
if len(words) == 3: # <-- replaced "last word" with "3rd word"
word = words[2]
my_dict[word[:-n]] = word[-n:]
return my_dict
if __name__ == "__main__":
my_dict = makeSuffixDict()
for key,value in my_dict.items():
print(key, value)
Upvotes: 1