CuriousDude
CuriousDude

Reputation: 1117

Why are the values in my dictionary returning as a list within a list for each element?

I've got a file with an id and lineage info for species.

For example:

162,Bacteria,Spirochaetes,Treponemataceae,Spirochaetia,Treponema
174,Bacteria,Spirochaetes,Leptospiraceae,Spirochaetia,Leptospira
192,Bacteria,Proteobacteria,Azospirillaceae,Alphaproteobacteria,Azospirillum
195,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
197,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
199,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
201,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
2829358,,,,,
2806529,Eukaryota,Nematoda,,,

I'm writing a script where I need to get the counts for each lineage depending on user input (i.e. if genus, then I would be looking at the last word in each line such as Treponema, if class, then the fourth, etc).

I'll later need to convert the counts into a dataframe but first I am trying to turn this file of lineage info into a dictionary where depending on user input, that lineage info (i.e. let's say genus) is the key, and the id is the value. This is because there can be multiple ids that match to the same lineage info such as ids 195, 197, 199, 201 would all return a hit for Campylobacter.

Here is my code:

def create_dicts(filename):
        '''Transforms the unique_taxids_lineage_allsamples file into a dictionary.
        Note: There can be multiple ids mapping to the same lineage info. Therefore ids should be values.'''
        # Creating a genus_dict
        unique_ids_dict={} # the main dict to return
        phylum_dict=2 # third item in line
        family_dict=3 # fourth item in line
        class_dict=4 # fifth item in line
        genus_dict=5 # sixth item in line
        type_of_dict=input("What type of dict? 2=phylum, 3=family, 4=class, 5=genus\n")

        with open(filename, 'r') as f:
                content = f.readlines()

        for line in content:
                key = line.split(",")[int(type_of_dict)].strip("\n") # lineage info
                value = line.split(",")[0].split("\n") # the id, there can be multiple mapping to the same key
                if key in unique_ids_dict:  # if the lineage info is already a key, skip
                        unique_ids_dict[key].append(value)
                else:
                        unique_ids_dict[key]=value
        return unique_ids_dict

I had to add the .split("\n") at the end of value because I kept getting the error where str object doesn't have attribute append.

I am trying to get a dictionary like the following if the user input was 5 for genus:

unique_ids_dict={'Treponema': ['162'], 'Leptospira': ['174'], 'Azospirillum': ['192'], 'Campylobacter': ['195', '197', '199', '201'], '': ['2829358', '2806529']}

But instead I am getting the following:

unique_ids_dict={'Treponema': ['162'], 'Leptospira': ['174'], 'Azospirillum': ['192'], 'Campylobacter': ['195', ['197'], ['199'], ['201']], '': ['2829358', ['2806529']]} ##missing str "NONE" haven't figured out how to convert empty strings to say "NONE"

Also, if anyone knows how to convert all empty hits into "NONE" or something of the following that would be great. This is sort of a secondary question so if needed I can open this as a separate question.

Thank you!

SOLVED ~~~~ NEeded to use extend instead of append.

To change emtpy string into a variable I used dict.pop so after my if statement

unique_ids_dict["NONE"] = unique_ids_dict.pop("")

Thank you!

Upvotes: 0

Views: 69

Answers (2)

Zander Augusto
Zander Augusto

Reputation: 16

I suggest that you work with Pandas, it's much simpler, and also it's good to assure header names:

import pandas as pd


def create_dicts(filename):
        """
        Transforms the unique_taxids_lineage_allsamples file into a 
        dictionary.
        
        Note: There can be multiple ids mapping to the same lineage info. 
        Therefore ids should be values.
        """
        
        # Reading File:
        content = pd.read_csv(
            filename,
            names=("ID", "Kingdom", "Phylum", "Family", "Class", "Genus")
        )

        # Printing input and choosing clade to work with:
        print("\nWhat type of dict?")
        print("- Phylum")
        print("- Family")
        print("- Class")
        print("- Genus")
        
        clade = input("> ").capitalize()

        # Replacing empty values with string 'None':
        content = content.where(pd.notnull(content), "None")
        
        # Selecting columns and aggregating accordingly to the chosen
        # clade and ID:
        series = content.groupby(clade).agg("ID").unique()

        # Creating dict:
        content_dict = series.to_dict()

        # If you do not want to work with Numpy arrays, just create
        # another dict of lists:
        content_dict = {k:list(v) for k, v in content_dict.items()}

        return content_dict


if __name__ == "__main__":
    
    d = create_dicts("temp.csv")

    print(d)

temp.csv:

162,Bacteria,Spirochaetes,Treponemataceae,Spirochaetia,Treponema
174,Bacteria,Spirochaetes,Leptospiraceae,Spirochaetia,Leptospira
192,Bacteria,Proteobacteria,Azospirillaceae,Alphaproteobacteria,Azospirillum
195,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
197,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
199,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
201,Bacteria,Proteobacteria,Campylobacteraceae,Epsilonproteobacteria,Campylobacter
829358,,,,,
2806529,Eukaryota,Nematoda,,,

I hope this is what you wanted to do.

Upvotes: 0

Michal
Michal

Reputation: 162

def create_dicts(filename):
    '''Transforms the unique_taxids_lineage_allsamples file into a dictionary.
    Note: There can be multiple ids mapping to the same lineage info. Therefore ids should be values.'''
    # Creating a genus_dict
    unique_ids_dict = {}  # the main dict to return
    phylum_dict = 2  # third item in line
    family_dict = 3  # fourth item in line
    class_dict = 4  # fifth item in line
    genus_dict = 5  # sixth item in line
    type_of_dict = input("What type of dict? 2=phylum, 3=family, 4=class, 5=genus\n")

with open(filename, 'r') as f:
    content = f.readlines()

for line in content:
    key = line.split(",")[int(type_of_dict)].strip("\n")  # lineage info
    value = line.split(",")[0].split("\n")  # the id, there can be multiple mapping to the same key
    if key in unique_ids_dict:  # if the lineage info is already a key, skip
        unique_ids_dict[key].**extend**(value)
    else:
        unique_ids_dict[key] = value
return unique_ids_dict

This worked for me. Using extend on list not append.

Upvotes: 2

Related Questions