Reputation: 53

Compare csv file data to a dictionary items in Python

I need to compare key(DNA_Base) and value(number) of a dict to some data of csv file and print the matching one. the problem is that the csv file has 3 things a person Name, a string(DNA_Base) and a number I want to compare the DNA_Base and its parallel number for the given persons and if it matches any item of the dictionary then it should print the name of the person, who has this specific number for this specific DNA_Base. the dict that I wanna compare is STR_max should looks like this

STR_max = {'AATG': 8 , 'TATC': 10 , 'AGATC': 9 , 'AGAG': 13}

so it should print Alice for this csv file and if there is no matching will print some text

name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5

and this my code

import sys
from sys import argv
import csv

#check correct command line argument
if len(sys.argv) != 3:
    print("Usage: python dna.py data.csv sequence.txt")
    exit(1)

#get the file path from the command line argument
csv_path = argv[1]
seq_path = argv[2]


# Opens csv file
with open(csv_path, newline='') as csvfile:
    readcsv = csv.reader(csvfile)

# Gets accsess to STR names
    csv_rows = list(readcsv)
    str_names = csv_rows[0]

# Opens the DNA sequence
seqtxt = open(seq_path, "r")
str_seq = seqtxt.read()

#Dict so store the counting of str
STR_max = {}

#iterate over the STR of the database
for str_name in str_names[1:]:
    maxCount = 0
    actualCount = 0
    str_name_len = len(str_name)
    str_seq_len = len(str_seq)
    i = 0
    found = False

    #iterate over the DNA Seq and count the str_name
    while i < str_seq_len:
        #find the STR in range of str_name[i : i+str_name_len]
        find = str_seq.count(str_names, i, i + str_name_len)

        #if the 1st STR found then start counting from it
        if find > 0 and found == False:
            actualCount = 1
            i = i + str_name_len
            found = True

        #if another STR is found again next to the previous one
        elif find > 0 and found == True:
            actualCount += 1
            i = i + str_name_len

        else:
            i += 1
            found = False

        if actualCount > maxCount:
            maxCount = actualCount


    #adding the STR and its maxCount to a buffer dict
    STR_max[str_name] = maxCount

Upvotes: 0

Answers (3)

Vidarshana Dissanayaka

Reputation: 55

Use Dict.reader

with open(csv_path, newline='') as csvfile:
        readcsv = csv.Dictreader(csvfile)
        for row in dict1:
            Found = False
            for i in range(1,len(readcsv.fieldnames)):
            
                if (row[readcsv.fieldnames[i]]==STR_max[readcsv.fieldnames[i]]):
                   Found = True
              
                else : 
                   Found = False
                   break
            if Found:
               print(row['name'])
               break
        if not Found: print("No match")

This will compare the fieldnames of csvfile(csvreader) starting from the second column, with the values of STR_max dictionary. This works because the key values of the STR_max are same as the fieldnames of the csv.

Upvotes: 1

sin tribu

Reputation: 1180

What you essentially need is for each sequence in STR_max you want to look down the column, and if it is a match in value, return the name of the person at that row. I'm sure there are libraries that make this a lot easier but I would just write a few helper functions that let you access the data you need.

#for each kmer in STR_max we need to know the csv column it corresponds to
# returns { "AGATC": 1, "AATG": 2,...}
def kmer_to_column_map():
    with open( 'test.csv' ) as csv:
        for line in csv:
            return { kmer: i for i, kmer in enumerate(line.strip().split(",")) }

#Then we need to get the column data from the csv file. Write a helper function that extracts a column from the csv file
#make a function to extract a column of values from the csv file 

#returns the list of values from a column given an index
def get_csv_column( column_index ):
    column = []
    with open( "test.csv" ) as csv:
        for i, line in enumerate(csv):
            if i == 0: continue 
            column.append( line.strip().split(",")[column_index] )
    return column

#now we lets grab what we need to iterate through kmers and print the names 
STR_max = {'AATG': 8 , 'TATC': 10 , 'AGATC': 9 , 'AGAG': 13}
names = get_csv_column(0)

for kmer, value in STR_max.items():
    # don't print if there is no data in csv file with this kmer
    if kmer not in kmer_map: continue 
    
    #get the column number in the csv file for this kmer 
    column_index = kmer_map[kmer]  
    
    #get the column data for this kmer - note that names and values have the same indices.
    values       = get_csv_column( column_index )
    
    #find the column values that are equal to STR_max value - i is the index of the value that matches which will be the index in names of the name that matches
    matches = [ i for i, patientval in enumerate( values ) if value == int(patientval)]

    #print the name of the person for each match - 
    if matches:
        print( kmer )
        for match in matches:
            print( f"-> {names[match]}" )
        print()

Note that I'm opening the file each time I need a column. You could rewrite this to upload the file all at once but I know sequence data can be large. By opening the file with keyword "with" the file is opened as an iterable and searched line by line without loading the entire file.

UPDATED

In response to your question (and sorry rereading I realize my comments were a little poor)

get_kmer_map returns kmer_map and the only thing it does is take the kmers in the first row of the .csv file and return a dict with the column number as keys. For the csv file it would be something like...

kmer_map = {
    "AGATC": 1,
    "AATG": 2,
    "TATC": 3 
}

Once we have the column number, we use get_csv_column to get the row. For example:

get_csv_column( 2 )
#[8, 1, 2] <- 2nd column from csv file

or equivalently

get_csv_column( kmer_map['AATG'] )
#[8,1,2]

We then iterate through the keys of STR_MAX to determine if there is a column for that kmer in the .csv file. In this case, the last element of STR_MAX "AGAG" is not in the csv file and will therefore not have any matches to the patient, and so the line:

if kmer not in kmer_map: continue

Skips looking for a match (and prevents a index error).

If the kmer is in the csv file, we need to know the column number so we can grab the values.

column_index = kmer_map[kmer] #<-returns column 2 for kmer_map['AATG']

Then we get the column of data for the specified kmer

values       = get_csv_column( column_index )
# get_csv_column(2) = [ 8, 1, 2]

If there is a match, it will be in values. The last thing we need is names

names = get_csv_column( names )

#Now we have two list where the index of Alice is the same index for ALices value for that kmer.
#for kmer="AATG"
#values: [ 8, 1, 2 ]
#names:  [ Alice, Bob, Charlie]

Lastly, we check the value of STR_MAX for any matches and if so, find the index of the match, and print names[index] which is the answer you're after. I put this logic on one line using list comprehensions.

matches = [ i for i, patientval in enumerate( values ) if value == int(patientval)]

Which could be rewritten like the following for clarity

matches = []
for i, patientval in enumerate( values ):
    if patientval == value:
        matches.append( i )

For the case of "AATG", with STR_MAX value of 8, it matches values [8,1,2] at index 0, and the name at index 0 in names is Alice.

Hope that helps

Upvotes: 1

Back2Basics

Reputation: 7806

Take time out to first watch a video about looping this will save you time and sanity in the future. https://www.youtube.com/watch?v=EnSu9hHGq5o

It looks like this is a toy problem you should get familiar with the algorithm called "trie" it will help you search one DNA sequence for several strings at the same time rather than this triple nested for loop. It will run in N^3 time.

here is an example: https://towardsdatascience.com/implementing-a-trie-data-structure-in-python-in-less-than-100-lines-of-code-a877ea23c1a1

Upvotes: 1

Compare csv file data to a dictionary items in Python

Answers (3)

UPDATED

Related Questions