Reputation: 53
I need to compare key(DNA_Base
) and value(number) of a dict to some data of csv file and print the matching one.
the problem is that the csv file has 3 things a person Name, a string(DNA_Base
) and a number I want to compare the DNA_Base
and its parallel number for the given persons and if it matches any item of the dictionary then it should print the name of the person, who has this specific number for this specific DNA_Base
. the dict that I wanna compare is STR_max
should looks like this
STR_max = {'AATG': 8 , 'TATC': 10 , 'AGATC': 9 , 'AGAG': 13}
so it should print Alice for this csv file and if there is no matching will print some text
name,AGATC,AATG,TATC Alice,2,8,3 Bob,4,1,5 Charlie,3,2,5
and this my code
import sys
from sys import argv
import csv
#check correct command line argument
if len(sys.argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
exit(1)
#get the file path from the command line argument
csv_path = argv[1]
seq_path = argv[2]
# Opens csv file
with open(csv_path, newline='') as csvfile:
readcsv = csv.reader(csvfile)
# Gets accsess to STR names
csv_rows = list(readcsv)
str_names = csv_rows[0]
# Opens the DNA sequence
seqtxt = open(seq_path, "r")
str_seq = seqtxt.read()
#Dict so store the counting of str
STR_max = {}
#iterate over the STR of the database
for str_name in str_names[1:]:
maxCount = 0
actualCount = 0
str_name_len = len(str_name)
str_seq_len = len(str_seq)
i = 0
found = False
#iterate over the DNA Seq and count the str_name
while i < str_seq_len:
#find the STR in range of str_name[i : i+str_name_len]
find = str_seq.count(str_names, i, i + str_name_len)
#if the 1st STR found then start counting from it
if find > 0 and found == False:
actualCount = 1
i = i + str_name_len
found = True
#if another STR is found again next to the previous one
elif find > 0 and found == True:
actualCount += 1
i = i + str_name_len
else:
i += 1
found = False
if actualCount > maxCount:
maxCount = actualCount
#adding the STR and its maxCount to a buffer dict
STR_max[str_name] = maxCount
Upvotes: 0
Views: 1807
Reputation: 55
Use Dict.reader
with open(csv_path, newline='') as csvfile:
readcsv = csv.Dictreader(csvfile)
for row in dict1:
Found = False
for i in range(1,len(readcsv.fieldnames)):
if (row[readcsv.fieldnames[i]]==STR_max[readcsv.fieldnames[i]]):
Found = True
else :
Found = False
break
if Found:
print(row['name'])
break
if not Found: print("No match")
This will compare the fieldnames of csvfile(csvreader) starting from the second column, with the values of STR_max dictionary. This works because the key values of the STR_max are same as the fieldnames of the csv.
Upvotes: 1
Reputation: 1180
What you essentially need is for each sequence in STR_max you want to look down the column, and if it is a match in value, return the name of the person at that row. I'm sure there are libraries that make this a lot easier but I would just write a few helper functions that let you access the data you need.
#for each kmer in STR_max we need to know the csv column it corresponds to
# returns { "AGATC": 1, "AATG": 2,...}
def kmer_to_column_map():
with open( 'test.csv' ) as csv:
for line in csv:
return { kmer: i for i, kmer in enumerate(line.strip().split(",")) }
#Then we need to get the column data from the csv file. Write a helper function that extracts a column from the csv file
#make a function to extract a column of values from the csv file
#returns the list of values from a column given an index
def get_csv_column( column_index ):
column = []
with open( "test.csv" ) as csv:
for i, line in enumerate(csv):
if i == 0: continue
column.append( line.strip().split(",")[column_index] )
return column
#now we lets grab what we need to iterate through kmers and print the names
STR_max = {'AATG': 8 , 'TATC': 10 , 'AGATC': 9 , 'AGAG': 13}
names = get_csv_column(0)
for kmer, value in STR_max.items():
# don't print if there is no data in csv file with this kmer
if kmer not in kmer_map: continue
#get the column number in the csv file for this kmer
column_index = kmer_map[kmer]
#get the column data for this kmer - note that names and values have the same indices.
values = get_csv_column( column_index )
#find the column values that are equal to STR_max value - i is the index of the value that matches which will be the index in names of the name that matches
matches = [ i for i, patientval in enumerate( values ) if value == int(patientval)]
#print the name of the person for each match -
if matches:
print( kmer )
for match in matches:
print( f"-> {names[match]}" )
print()
Note that I'm opening the file each time I need a column. You could rewrite this to upload the file all at once but I know sequence data can be large. By opening the file with keyword "with" the file is opened as an iterable and searched line by line without loading the entire file.
In response to your question (and sorry rereading I realize my comments were a little poor)
get_kmer_map
returns kmer_map
and the only thing it does is take the kmers in the first row of the .csv file and return a dict with the column number as keys. For the csv file it would be something like...
kmer_map = {
"AGATC": 1,
"AATG": 2,
"TATC": 3
}
Once we have the column number, we use get_csv_column
to get the row. For example:
get_csv_column( 2 )
#[8, 1, 2] <- 2nd column from csv file
or equivalently
get_csv_column( kmer_map['AATG'] )
#[8,1,2]
We then iterate through the keys of STR_MAX to determine if there is a column for that kmer in the .csv file. In this case, the last element of STR_MAX "AGAG" is not in the csv file and will therefore not have any matches to the patient, and so the line:
if kmer not in kmer_map: continue
Skips looking for a match (and prevents a index error).
If the kmer is in the csv file, we need to know the column number so we can grab the values.
column_index = kmer_map[kmer] #<-returns column 2 for kmer_map['AATG']
Then we get the column of data for the specified kmer
values = get_csv_column( column_index )
# get_csv_column(2) = [ 8, 1, 2]
If there is a match, it will be in values. The last thing we need is names
names = get_csv_column( names )
#Now we have two list where the index of Alice is the same index for ALices value for that kmer.
#for kmer="AATG"
#values: [ 8, 1, 2 ]
#names: [ Alice, Bob, Charlie]
Lastly, we check the value of STR_MAX for any matches and if so, find the index of the match, and print names[index] which is the answer you're after. I put this logic on one line using list comprehensions.
matches = [ i for i, patientval in enumerate( values ) if value == int(patientval)]
Which could be rewritten like the following for clarity
matches = []
for i, patientval in enumerate( values ):
if patientval == value:
matches.append( i )
For the case of "AATG", with STR_MAX value of 8, it matches values [8,1,2] at index 0, and the name at index 0 in names is Alice.
Hope that helps
Upvotes: 1
Reputation: 7806
Take time out to first watch a video about looping this will save you time and sanity in the future. https://www.youtube.com/watch?v=EnSu9hHGq5o
It looks like this is a toy problem you should get familiar with the algorithm called "trie" it will help you search one DNA sequence for several strings at the same time rather than this triple nested for loop. It will run in N^3 time.
here is an example: https://towardsdatascience.com/implementing-a-trie-data-structure-in-python-in-less-than-100-lines-of-code-a877ea23c1a1
Upvotes: 1