pythomatic
pythomatic

Reputation: 657

Extracting values from list of dicts to be used in calculation

My input data is a list of dicts (matches), where each dict has 2 possible places for a record to show up as well as a correlating factor between the two and their respective data sources:

[
  { 'r1': record_1, 'r2': record_2, corr: 85, 'r1_source': source_1, 'r2_source': source_2 },
  { 'r1': record_1, 'r2': record_3, corr: 90, 'r1_source': source_1, 'r2_source': source_3 },
  { 'r1': record_2, 'r2': record_3, corr: 77, 'r1_source': source_2, 'r2_source': source_3 },
  ...
]

Each record is represented by a list which comes from a finite list of unique records.

The structure of my desired output data is a list of dicts where each unique record has itself, its source, and its average correlating factor:

[
  { 'record': record_1, 'source': source_1, 'avg': (85 + 90) / 2 },
  { 'record': record_2, 'source': source_2, 'avg': (85 + 77) / 2 },
  { 'record': record_3, 'source': source_3, 'avg': (90 + 77) / 2 },
]

My current solution:

def average_record_from_match_value(matches):
    averaged_recs = []

    for match in matches:
        # Q1 
        if [rec for rec in averaged_recs if rec['record'] == match['r1']] == []:
            a_recs = []
            # Q2
            a_recs.extend([m['corr'] for m in matches if m['r1'] == match['r1']])
            a_recs.extend([m['corr'] for m in matches if m['r2'] == match['r1']])
            # Q3
            r1_value = sum(a_recs) / len(a_recs)
            averaged_recs.append({ 'record': match['r1'],
                                   'source': match['r1_source'],
                                   'match_value': r1_value,
                                   'record_value': r1_value})
        if [rec for rec in averaged_recs if rec['record'] == match['r2']] == []:
            b_recs = []
            b_recs.extend([m['corr'] for m in matches if m['r1'] == match['r2']])
            b_recs.extend([m['corr'] for m in matches if m['r2'] == match['r2']])
            r2_value = sum(b_recs) / len(b_recs)
            averaged_recs.append({ 'record': match['r2'],
                                   'source': match['r2_source'],
                                   'match_value': r2_value,
                                   'record_value': r2_value})

    return averaged_recs

This works, but I'm sure it can be improved. My questions as labeled by the comments above are:

  1. Is there a better way to enforce uniqueness here? I have a gut feeling that I don't need to be traversing my averaged_recs list for every match.
  2. Can I corral all of these records without looping over them twice like this?
  3. Can/should this average calculation be combined with the previous list extension?

Thanks for your help!

Upvotes: 0

Views: 67

Answers (3)

Sphinx
Sphinx

Reputation: 10729

My idea, We can loop the list to generate one dict for all r1, r2, if r1, append it to the head of the list, if r2, add it to the tail.

Then loop this dict to get the output you expected.

from collections import defaultdict
test = [
  { 'r1': 'record_1', 'r2': 'record_2', 'corr': 85, 'r1_source': 'source_1', 'r2_source': 'source_2' },
  { 'r1': 'record_1', 'r2': 'record_3', 'corr': 90, 'r1_source': 'source_1', 'r2_source': 'source_3' },
  { 'r1': 'record_2', 'r2': 'record_3', 'corr': 77, 'r1_source': 'source_2', 'r2_source': 'source_3' },
]
temp = defaultdict(list)
for item in test:
    temp[item['r1']].insert(0, item)
    temp[item['r2']].append(item)

result = []
for key, value in temp.items():
    new_item = {}
    new_item['avg'] = sum(list(map(lambda item: item['corr'], value)))*1.0/len(value)
    new_item['record'] = key
    new_item['source'] = value[0]['r1_source'] if key == value[0]['r1'] else value[0]['r2_source']
    result.append(new_item)
print(result)

Output:

[{'avg': 87.5, 'record': 'record_1', 'source': 'source_1'}, {'avg': 81.0, 'record': 'record_2', 'source': 'source_2'}, {'avg': 83.5, 'record': 'record_3', 'source': 'source_3'}]
[Finished in 0.175s]

Update 1:

If r1 and r2 are the list, we can convert it to tuple, then convert it back when calculate the output.

so the codes will be like:

from collections import defaultdict
record1 = [1, 2, 3]
record2 = [4, 5, 6]
record3 = [7, 8, 9]
test = [
  { 'r1': record1, 'r2': record2, 'corr': 85, 'r1_source': 'source_1', 'r2_source': 'source_2' },
  { 'r1': record1, 'r2': record3, 'corr': 90, 'r1_source': 'source_1', 'r2_source': 'source_3' },
  { 'r1': record2, 'r2': record3, 'corr': 77, 'r1_source': 'source_2', 'r2_source': 'source_3' },
]
temp = defaultdict(list)
for item in test:
    temp[tuple(item['r1'])].insert(0, item)
    temp[tuple(item['r2'])].append(item)

result = []
for key, value in temp.items():
    new_item = {}
    new_item['avg'] = sum(list(map(lambda item: item['corr'], value)))*1.0/len(value)
    new_item['record'] = list(key)
    new_item['source'] = value[0]['r1_source'] if list(key) == value[0]['r1'] else value[0]['r2_source']
    result.append(new_item)
print(result)

Output:

[{'avg': 87.5, 'record': [1, 2, 3], 'source': 'source_3'}, {'avg': 81.0, 'record': [4, 5, 6], 'source': 'source_3'}, {'avg': 83.5, 'record': [7, 8, 9], 'source': 'source_3'}]
[Finished in 0.178s]

Upvotes: 1

shahaf
shahaf

Reputation: 4983

it's a little hard to do it with list comprehension but I managed to write it with a few less line and hopefully clutter using a tmp dict to sort the keys

lst = [
  { 'r1': 'record_1', 'r2': 'record_2', 'corr': 85, 'r1_source': 'source_1', 'r2_source': 'source_2' },
  { 'r1': 'record_1', 'r2': 'record_3', 'corr': 90, 'r1_source': 'source_1', 'r2_source': 'source_3' },
  { 'r1': 'record_2', 'r2': 'record_3', 'corr': 77, 'r1_source': 'source_2', 'r2_source': 'source_3' },
]

tmp_dict = {}
for d in lst:
    if d['r1'] not in tmp_dict.keys():
        tmp_dict[d['r1']] = {}
        tmp_dict[d['r1']]['corr'] = list()
        tmp_dict[d['r1']]['source'] = d['r1_source']

    if d['r2'] not in tmp_dict.keys():
        tmp_dict[d['r2']] = {}
        tmp_dict[d['r2']]['corr'] = list()
        tmp_dict[d['r2']]['source'] = d['r2_source']

    tmp_dict[d['r1']]['corr'].append(d['corr'])
    tmp_dict[d['r2']]['corr'].append(d['corr'])


print [{ 'record': k, 'source': tmp_dict[k]['source'], 'avg': sum(tmp_dict[k]['corr'])/float(len(tmp_dict[k]['corr'])) } for k in tmp_dict.keys()]

Upvotes: 1

Mateus Terra
Mateus Terra

Reputation: 179

Q1 - A dictionary is inherently made by unique elements so I don't believe you need to recheck it this way. You're also iterating through averaged recs, which is empty.

Q2 - You could use or in the if statement [m['corr'] for m in matches if m['r1'] == match['r1'] or m['r2'] == match['r1']]

Q3 - I don't really think you need another way to do it

Upvotes: 0

Related Questions