Looping through multiple dictionaries to create new dictionary from values in python

Question

I have two lists of dictionaries. Both contain an item of data as well as a start and stop timestamp. The first list contains dictionaries representing observations of sequences of text with a start and stop time. It looks like this:

list_1 = [
      {'word': 'hey hows it going?', 's1': 1.2, 's2': 3.6},
      {'word': 'um', 's1': 3.7, 's2': 4.2},
      {'word': 'its raining outside today', 's1': 4.3, 's2': 5.0},
      {'word': 'and its really cold', 's1': 5.1, 's2': 6.6},
      {'word': 'dont you think?', 's1': 6.7, 's2': 8.1},
      {'word': 'its awful', 's1': 7.7, 's2': 9.0}
    ]

The second list contains dictionaries representing observations of categories with a start and stop time. It looks like this:

list_2 = [
  {'category': 0, 's1': 0.0, 's2': 3.8},
  {'category': 1, 's1': 3.9, 's2': 4.9},
  {'category': 1, 's1': 5.0, 's2': 7.2},
  {'category': 0, 's1': 7.3, 's2': 7.6},
  {'category': 1, 's1': 7.7, 's2': 9.0}
]

I want to create a new item in the dictionaries for list_2 using list_1['word'] values according to the following logic:

If a value from list_1['s1'] is greater than a value from list_2['s1'] AND less than a value from list_2['s2'], append all values from list_1['word'] into the new item, list_2['word'].
If a value from list_1['s1'] is greater than a value from list_2['s1'] AND less than a value from list_2['s2'], BUT list_1['s2'] is greater than a value from list_2['s1'], append all values from list_1['word'] into the new item, list_2['word'] for the NEXT dictionary.

Another way to think about it is as you're looping through list_1 and list_2:

if the timestamps from list_1 items fall within the timestamps for a list_2 item, add the list_1 words to a new key value pair in list_2.
if timestamps from list_1 items do not fall within the timestamps for a list_2 item, for example the words 'start' in list_2[0] but 'end' in list_2[1] then add the list_1['words'] from list_1[0] to list_2[1].

It should look this this:

expected_output =[
   {'category': 0,
      's1': 0.0,
      's2': 3.8,
      'words': 'hey hows it going? um'},
   {'category': 1,
      's1': 3.9,
      's2': 4.9,
      'words': 'its raining outside today'},
   {'category': 1,
      's1': 5.0,
     's2': 7.2,
     'words': 'and its really cold'},
   {'category': 0,
      's1': 7.3,
      's2': 9.0,
      'words': 'dont you think? its awful'}
  ]

Cireo · Accepted Answer

Your original algorithm says "the NEXT" one, are you sure it is what you want? I tried to implement what you said, but it isn't clear what should happen when a phrase overlaps more than 2 speakers.

A few design notes:

your data would make more sense if the boundaries were [a, b) instead of [a, b] - where should 3.65 go?
it might be more re-usable to storing the values as a list (or determining the order of injection by start time) instead of flattening them to a string with spaces here. You can always flatten them afterwards

START, END = 's1', 's2'

def require_speaker(start, end):
    ''' Return the latest speaker in start <= time <= end '''
    # This should be an interval tree if your data is large
    # https://en.wikipedia.org/wiki/Interval_tree

    # Exactly one of the first 3 is true, so we could use an `else`,
    # listing all for clarity.
    after = lambda v: v[END] < start
    overlaps = lambda v: start <= v[END] and v[START] <= end
    before = lambda v: end < v[START]
    contained = lambda v: v[START] <= start and end <= v[END]

    take_next = False
    for speaker in list_2:
        if take_next:
            return speaker
        if after(speaker):
            continue
        elif contained(speaker):
            return speaker
        elif overlaps(speaker):
            take_next = True
        elif after(speaker):
            break  # Missed it somehow (can't happen if full coverage)
    raise LookupError('no speaker in range %s - %s' % (start, end))

# Prepare a list for phrases
for speakers in list_2:
    speakers['words'] = []
# Populate phrases for each speaker
for phrase in list_1:
    speaker = require_speaker(phrase[START], phrase[END])
    speaker['words'].append(phrase['word'])
# Convert back to string
for speakers in list_2:
    speakers['words'] = ' '.join(speakers['words'])

With your data

list_1 = [
      {'word': 'hey hows it going?', 's1': 1.2, 's2': 3.6},
      {'word': 'um', 's1': 3.7, 's2': 4.2},
      {'word': 'its raining outside today', 's1': 4.3, 's2': 5.0},
      {'word': 'and its really cold', 's1': 5.1, 's2': 6.6},
      {'word': 'dont you think?', 's1': 6.7, 's2': 8.1},
      {'word': 'its awful', 's1': 7.7, 's2': 9.0}
    ]

list_2 = [
  {'category': 0, 's1': 0.0, 's2': 3.8},
  {'category': 1, 's1': 3.9, 's2': 4.9},
  {'category': 1, 's1': 5.0, 's2': 7.2},
  {'category': 0, 's1': 7.3, 's2': 7.6},
  {'category': 1, 's1': 7.7, 's2': 9.0}
]

You get

>>> import pprint
>>> pprint.pprint(list_2)
[{'category': 0, 's1': 0.0, 's2': 3.8, 'words': 'hey hows it going?'},
 {'category': 1, 's1': 3.9, 's2': 4.9, 'words': 'um'},
 {'category': 1,
  's1': 5.0,
  's2': 7.2,
  'words': 'its raining outside today and its really cold'},
 {'category': 0, 's1': 7.3, 's2': 7.6, 'words': 'dont you think?'},
 {'category': 1, 's1': 7.7, 's2': 9.0, 'words': 'its awful'}]

Note that your expected output doesn't match your algorithm:

"um" (3.7-4.2) should be placed in the 3.9-4.9 range
"its raining outside today" (4.3-5.0) should be placed in the 5.0-7.2 range
"dont you think" (6.7-8.1) should be placed in the 7.7-9.0 range

Looping through multiple dictionaries to create new dictionary from values in python

Answers (1)

Related Questions