Reputation: 657
My input data is a list of dicts (matches
), where each dict has 2 possible places for a record to show up as well as a correlating factor between the two and their respective data sources:
[
{ 'r1': record_1, 'r2': record_2, corr: 85, 'r1_source': source_1, 'r2_source': source_2 },
{ 'r1': record_1, 'r2': record_3, corr: 90, 'r1_source': source_1, 'r2_source': source_3 },
{ 'r1': record_2, 'r2': record_3, corr: 77, 'r1_source': source_2, 'r2_source': source_3 },
...
]
Each record
is represented by a list which comes from a finite list of unique records
.
The structure of my desired output data is a list of dicts where each unique record
has itself, its source, and its average correlating factor:
[
{ 'record': record_1, 'source': source_1, 'avg': (85 + 90) / 2 },
{ 'record': record_2, 'source': source_2, 'avg': (85 + 77) / 2 },
{ 'record': record_3, 'source': source_3, 'avg': (90 + 77) / 2 },
]
My current solution:
def average_record_from_match_value(matches):
averaged_recs = []
for match in matches:
# Q1
if [rec for rec in averaged_recs if rec['record'] == match['r1']] == []:
a_recs = []
# Q2
a_recs.extend([m['corr'] for m in matches if m['r1'] == match['r1']])
a_recs.extend([m['corr'] for m in matches if m['r2'] == match['r1']])
# Q3
r1_value = sum(a_recs) / len(a_recs)
averaged_recs.append({ 'record': match['r1'],
'source': match['r1_source'],
'match_value': r1_value,
'record_value': r1_value})
if [rec for rec in averaged_recs if rec['record'] == match['r2']] == []:
b_recs = []
b_recs.extend([m['corr'] for m in matches if m['r1'] == match['r2']])
b_recs.extend([m['corr'] for m in matches if m['r2'] == match['r2']])
r2_value = sum(b_recs) / len(b_recs)
averaged_recs.append({ 'record': match['r2'],
'source': match['r2_source'],
'match_value': r2_value,
'record_value': r2_value})
return averaged_recs
This works, but I'm sure it can be improved. My questions as labeled by the comments above are:
averaged_recs
list
for every match. records
without looping
over them twice like this?Thanks for your help!
Upvotes: 0
Views: 67
Reputation: 10729
My idea, We can loop the list to generate one dict for all r1, r2, if r1, append it to the head of the list, if r2, add it to the tail.
Then loop this dict to get the output you expected.
from collections import defaultdict
test = [
{ 'r1': 'record_1', 'r2': 'record_2', 'corr': 85, 'r1_source': 'source_1', 'r2_source': 'source_2' },
{ 'r1': 'record_1', 'r2': 'record_3', 'corr': 90, 'r1_source': 'source_1', 'r2_source': 'source_3' },
{ 'r1': 'record_2', 'r2': 'record_3', 'corr': 77, 'r1_source': 'source_2', 'r2_source': 'source_3' },
]
temp = defaultdict(list)
for item in test:
temp[item['r1']].insert(0, item)
temp[item['r2']].append(item)
result = []
for key, value in temp.items():
new_item = {}
new_item['avg'] = sum(list(map(lambda item: item['corr'], value)))*1.0/len(value)
new_item['record'] = key
new_item['source'] = value[0]['r1_source'] if key == value[0]['r1'] else value[0]['r2_source']
result.append(new_item)
print(result)
Output:
[{'avg': 87.5, 'record': 'record_1', 'source': 'source_1'}, {'avg': 81.0, 'record': 'record_2', 'source': 'source_2'}, {'avg': 83.5, 'record': 'record_3', 'source': 'source_3'}]
[Finished in 0.175s]
Update 1:
If r1 and r2 are the list, we can convert it to tuple, then convert it back when calculate the output.
so the codes will be like:
from collections import defaultdict
record1 = [1, 2, 3]
record2 = [4, 5, 6]
record3 = [7, 8, 9]
test = [
{ 'r1': record1, 'r2': record2, 'corr': 85, 'r1_source': 'source_1', 'r2_source': 'source_2' },
{ 'r1': record1, 'r2': record3, 'corr': 90, 'r1_source': 'source_1', 'r2_source': 'source_3' },
{ 'r1': record2, 'r2': record3, 'corr': 77, 'r1_source': 'source_2', 'r2_source': 'source_3' },
]
temp = defaultdict(list)
for item in test:
temp[tuple(item['r1'])].insert(0, item)
temp[tuple(item['r2'])].append(item)
result = []
for key, value in temp.items():
new_item = {}
new_item['avg'] = sum(list(map(lambda item: item['corr'], value)))*1.0/len(value)
new_item['record'] = list(key)
new_item['source'] = value[0]['r1_source'] if list(key) == value[0]['r1'] else value[0]['r2_source']
result.append(new_item)
print(result)
Output:
[{'avg': 87.5, 'record': [1, 2, 3], 'source': 'source_3'}, {'avg': 81.0, 'record': [4, 5, 6], 'source': 'source_3'}, {'avg': 83.5, 'record': [7, 8, 9], 'source': 'source_3'}]
[Finished in 0.178s]
Upvotes: 1
Reputation: 4983
it's a little hard to do it with list comprehension but I managed to write it with a few less line and hopefully clutter using a tmp dict to sort the keys
lst = [
{ 'r1': 'record_1', 'r2': 'record_2', 'corr': 85, 'r1_source': 'source_1', 'r2_source': 'source_2' },
{ 'r1': 'record_1', 'r2': 'record_3', 'corr': 90, 'r1_source': 'source_1', 'r2_source': 'source_3' },
{ 'r1': 'record_2', 'r2': 'record_3', 'corr': 77, 'r1_source': 'source_2', 'r2_source': 'source_3' },
]
tmp_dict = {}
for d in lst:
if d['r1'] not in tmp_dict.keys():
tmp_dict[d['r1']] = {}
tmp_dict[d['r1']]['corr'] = list()
tmp_dict[d['r1']]['source'] = d['r1_source']
if d['r2'] not in tmp_dict.keys():
tmp_dict[d['r2']] = {}
tmp_dict[d['r2']]['corr'] = list()
tmp_dict[d['r2']]['source'] = d['r2_source']
tmp_dict[d['r1']]['corr'].append(d['corr'])
tmp_dict[d['r2']]['corr'].append(d['corr'])
print [{ 'record': k, 'source': tmp_dict[k]['source'], 'avg': sum(tmp_dict[k]['corr'])/float(len(tmp_dict[k]['corr'])) } for k in tmp_dict.keys()]
Upvotes: 1
Reputation: 179
Q1 - A dictionary is inherently made by unique elements so I don't believe you need to recheck it this way. You're also iterating through averaged recs, which is empty.
Q2 - You could use or in the if statement
[m['corr'] for m in matches if m['r1'] == match['r1'] or m['r2'] == match['r1']]
Q3 - I don't really think you need another way to do it
Upvotes: 0