Reputation: 416
I have two dictionaries which contain nested sub-dictionaries. They are structured as follows:
search_regions = {
'chr11:56694718-71838208': {'Chr': 'chr11', 'End': 71838208, 'Start': 56694718},
'chr13:27185654-39682032': {'Chr': 'chr13', 'End': 39682032, 'Start': 27185654}
}
database_variants = {
'chr11:56694718-56694718': {'Chr': 'chr11', 'End': 56694718, 'Start': 56694718},
'chr13:27185659-27185659': {'Chr': 'chr13', 'End': 27185659, 'Start': 27185659}
}
I need to compare them and pull out the dictionaries in database_variants which fall in the range of the dictionaries in search_regions.
I am building a function to do this (linked to a previous question). This is what I have so far:
def region_to_variant_location_match(search_Variants, database_Variants):
'''Take dictionaries for search_Variants and database_Variants as input.
Match variants in database_Variants to regions within search_Variants.
Return matches as a nested dictionary.'''
#Match on Chr value
#Where Start value from database_variant is between St and End values in
search_variants.
#return as nested dictionary
The problem I am having is working out how to get to the values in the nested dictionaries (Chr, St, End, etc) for the comparison. I'd like to do this using list comprehension as I've got quite a bit of data to get through so a simpler for loop might be more time consuming.
Any help is much appreciated!
UPDATE
I've tried to implement the solution suggested by bioinfoboy below. My first step was to convert the search_regions and database_variants dictionaries into defaultdict(list) using the following functions:
def search_region_converter(searchDict):
'''This function takes the dictionary of dictionaries and converts it to a
DefaultDict(list) to allow matching
with the database in a corresponding format'''
search_regions = defaultdict(list)
for i in search_regions.keys():
chromosome = i.split(":")[0]
start = int(i.split(":")[1].split("-")[0])
end = int(i.split(":")[1].split("-")[1])
search_regions[chromosome].append((start, end))
return search_regions #a list with chromosomes as keys
def database_snps_converter(databaseDict):
'''This function takes the dictionary of dictionaries and converts it to a
DefaultDict(list) to allow matching
with the serach_snps in a corresponding format'''
database_variants = defaultdict(list)
for i in database_variants.keys():
chromosome = i.split(":")[0]
start = int(i.split(":")[1].split("-")[0])
database_variants[chromosome].append(start)
return database_variants #list of database variants
Then I have made a function for matching (again with bioinfoboy's code), which is as follows:
def region_to_variant_location_match(search_Regions, database_Variants):
'''Take dictionaries for search_Variants and database_Variants as
input.
Match variants in database_Variants to regions within search_Variants.'''
for key, values in database_Variants.items():
for value in values:
for search_area in search_Regions[key]:
print(search_area)
if (value >= search_area[0]) and (value <= search_area[1]):
yield(key, search_area)
However the defaultdict functions return empty dictionaries and I can't quite work out what I need to change.
Any ideas?
Upvotes: 1
Views: 2209
Reputation: 787
You should probably do something like
def region_to_variant_location_match(search_Variants, database_Variants):
'''Take dictionaries for search_Variants and database_Variants as input.
Match variants in database_Variants to regions within search_Variants.
Return matches as a nested dictionary.'''
return {
record[0]: record[1]
for record, lookup in zip(
database_Variants.items(),
search_Variants.items()
)
if (
record[1]['Chr'] == lookup[1]['Chr'] and
lookup[1]['Start'] <= record[1]['Start'] <= lookup[1]['End']
)
}
Note that if you were using Python 2.7 or lower (instead of Python 3), you would do iteritems()
instead of items()
and itertools.izip()
instead of zip
, and if you were using less than 2.6, you would need to switch to a generator comprehension being passed to dict()
instead of a dict
comprehension.
Upvotes: 1
Reputation: 132
I imagine this may help
I'm converting your search_regions
and database_variants
according to what I've mentioned in the comment.
from collections import defaultdict
_database_variants = defaultdict(list)
_search_regions = defaultdict(list)
for i in database_variants.keys():
_chromosome = i.split(":")[0]
_start = int(i.split(":")[1].split("-")[0])
_database_variants[_chromosome].append(_start)
_search_regions = defaultdict(list)
for i in search_regions.keys():
_chromosome = i.split(":")[0]
_start = int(i.split(":")[1].split("-")[0])
_end = int(i.split(":")[1].split("-")[1])
_search_regions[_chromosome].append((_start, _end))
def _search(_database_variants, _search_regions):
for key, values in _database_variants.items():
for value in values:
for search_area in _search_regions[key]:
if (value >= search_area[0]) and (value <= search_area[1]):
yield(key, search_area)
I've used yield
and thus would return a generator object on which you can iterate through. Considering the data that you've provided initially in the question, I get the following output.
for i in _search(_database_variants, _search_regions):
print(i)
The output is the following:
('chr11', (56694718, 71838208))
('chr13', (27185654, 39682032))
Is that not what you are trying to achieve?
Upvotes: 1