s_boardman
s_boardman

Reputation: 416

Matching values in nested dictionaries

I have two dictionaries which contain nested sub-dictionaries. They are structured as follows:

search_regions = {
    'chr11:56694718-71838208': {'Chr': 'chr11', 'End': 71838208, 'Start': 56694718},
    'chr13:27185654-39682032': {'Chr': 'chr13', 'End': 39682032, 'Start': 27185654}
}

database_variants = {
    'chr11:56694718-56694718': {'Chr': 'chr11', 'End': 56694718, 'Start': 56694718},
    'chr13:27185659-27185659': {'Chr': 'chr13', 'End': 27185659, 'Start': 27185659}
}

I need to compare them and pull out the dictionaries in database_variants which fall in the range of the dictionaries in search_regions.

I am building a function to do this (linked to a previous question). This is what I have so far:

def region_to_variant_location_match(search_Variants, database_Variants):
    '''Take dictionaries for search_Variants and database_Variants as input.
    Match variants in database_Variants to regions within search_Variants.
    Return matches as a nested dictionary.'''
    #Match on Chr value
        #Where Start value from database_variant is between St and End values in 
        search_variants.
    #return as nested dictionary

The problem I am having is working out how to get to the values in the nested dictionaries (Chr, St, End, etc) for the comparison. I'd like to do this using list comprehension as I've got quite a bit of data to get through so a simpler for loop might be more time consuming.

Any help is much appreciated!

UPDATE

I've tried to implement the solution suggested by bioinfoboy below. My first step was to convert the search_regions and database_variants dictionaries into defaultdict(list) using the following functions:

def search_region_converter(searchDict):
    '''This function takes the dictionary of dictionaries and converts it to a
    DefaultDict(list) to allow matching   
    with the database in a corresponding format'''
    search_regions = defaultdict(list)
    for i in search_regions.keys():
        chromosome = i.split(":")[0]
        start = int(i.split(":")[1].split("-")[0])
        end = int(i.split(":")[1].split("-")[1])
        search_regions[chromosome].append((start, end))
    return search_regions #a list with chromosomes as keys 

def database_snps_converter(databaseDict):
    '''This function takes the dictionary of dictionaries and converts it to a
    DefaultDict(list) to allow matching   
    with the serach_snps in a corresponding format'''
    database_variants = defaultdict(list)
    for i in database_variants.keys():
        chromosome = i.split(":")[0]
        start = int(i.split(":")[1].split("-")[0])
        database_variants[chromosome].append(start)
    return database_variants #list of database variants 

Then I have made a function for matching (again with bioinfoboy's code), which is as follows:

def region_to_variant_location_match(search_Regions, database_Variants):
    '''Take dictionaries for search_Variants and database_Variants as 
    input.                                         
    Match variants in database_Variants to regions within search_Variants.'''
    for key, values in database_Variants.items():
        for value in values:
            for search_area in search_Regions[key]:
                print(search_area)
                if (value >= search_area[0]) and (value <= search_area[1]):
                    yield(key, search_area)

However the defaultdict functions return empty dictionaries and I can't quite work out what I need to change.

Any ideas?

Upvotes: 1

Views: 2209

Answers (2)

desfido
desfido

Reputation: 787

You should probably do something like

def region_to_variant_location_match(search_Variants, database_Variants):
    '''Take dictionaries for search_Variants and database_Variants as input.
    Match variants in database_Variants to regions within search_Variants.
    Return matches as a nested dictionary.'''
    return {
        record[0]: record[1]
        for record, lookup in zip(
            database_Variants.items(),
            search_Variants.items()
        )
        if (
            record[1]['Chr'] == lookup[1]['Chr'] and 
            lookup[1]['Start'] <= record[1]['Start'] <= lookup[1]['End']
        )
    }

Note that if you were using Python 2.7 or lower (instead of Python 3), you would do iteritems() instead of items() and itertools.izip() instead of zip, and if you were using less than 2.6, you would need to switch to a generator comprehension being passed to dict() instead of a dict comprehension.

Upvotes: 1

bioinfoboy
bioinfoboy

Reputation: 132

I imagine this may help

I'm converting your search_regions and database_variants according to what I've mentioned in the comment.

from collections import defaultdict
_database_variants = defaultdict(list)
_search_regions = defaultdict(list)
for i in database_variants.keys():
    _chromosome = i.split(":")[0]
    _start = int(i.split(":")[1].split("-")[0])
    _database_variants[_chromosome].append(_start)
_search_regions = defaultdict(list)
for i in search_regions.keys():
    _chromosome = i.split(":")[0]
    _start = int(i.split(":")[1].split("-")[0])
    _end = int(i.split(":")[1].split("-")[1])
    _search_regions[_chromosome].append((_start, _end))

def _search(_database_variants, _search_regions):
    for key, values in _database_variants.items():
        for value in values:
            for search_area in _search_regions[key]:
                if (value >= search_area[0]) and (value <= search_area[1]):
                    yield(key, search_area)

I've used yield and thus would return a generator object on which you can iterate through. Considering the data that you've provided initially in the question, I get the following output.

for i in _search(_database_variants, _search_regions):
    print(i)

The output is the following:

('chr11', (56694718, 71838208))
('chr13', (27185654, 39682032))

Is that not what you are trying to achieve?

Upvotes: 1

Related Questions