Reputation: 1679
I have two lists of dictionaries. Both contain an item of data as well as a start and stop timestamp. The first list contains dictionaries representing observations of sequences of text with a start and stop time. It looks like this:
list_1 = [
{'word': 'hey hows it going?', 's1': 1.2, 's2': 3.6},
{'word': 'um', 's1': 3.7, 's2': 4.2},
{'word': 'its raining outside today', 's1': 4.3, 's2': 5.0},
{'word': 'and its really cold', 's1': 5.1, 's2': 6.6},
{'word': 'dont you think?', 's1': 6.7, 's2': 8.1},
{'word': 'its awful', 's1': 7.7, 's2': 9.0}
]
The second list contains dictionaries representing observations of categories with a start and stop time. It looks like this:
list_2 = [
{'category': 0, 's1': 0.0, 's2': 3.8},
{'category': 1, 's1': 3.9, 's2': 4.9},
{'category': 1, 's1': 5.0, 's2': 7.2},
{'category': 0, 's1': 7.3, 's2': 7.6},
{'category': 1, 's1': 7.7, 's2': 9.0}
]
I want to create a new item in the dictionaries for list_2
using list_1['word']
values according to the following logic:
If a value from list_1['s1']
is greater than a value from list_2['s1']
AND less than a value from list_2['s2']
, append all values from list_1['word']
into the new item, list_2['word']
.
If a value from list_1['s1']
is greater than a value from list_2['s1']
AND less than a value from list_2['s2']
, BUT list_1['s2']
is greater than a value from list_2['s1']
, append all values from list_1['word']
into the new item, list_2['word']
for the NEXT dictionary.
Another way to think about it is as you're looping through list_1 and list_2:
if the timestamps from list_1
items fall within the timestamps for a list_2
item, add the list_1
words to a new key value pair in list_2
.
if timestamps from list_1
items do not fall within the timestamps for a list_2
item, for example the words 'start' in list_2[0]
but 'end' in list_2[1]
then add the list_1['words']
from list_1[0]
to list_2[1]
.
It should look this this:
expected_output =[
{'category': 0,
's1': 0.0,
's2': 3.8,
'words': 'hey hows it going? um'},
{'category': 1,
's1': 3.9,
's2': 4.9,
'words': 'its raining outside today'},
{'category': 1,
's1': 5.0,
's2': 7.2,
'words': 'and its really cold'},
{'category': 0,
's1': 7.3,
's2': 9.0,
'words': 'dont you think? its awful'}
]
Upvotes: 0
Views: 82
Reputation: 4427
Your original algorithm says "the NEXT" one, are you sure it is what you want? I tried to implement what you said, but it isn't clear what should happen when a phrase overlaps more than 2 speakers.
A few design notes:
[a, b)
instead of [a, b]
- where should 3.65 go?START, END = 's1', 's2'
def require_speaker(start, end):
''' Return the latest speaker in start <= time <= end '''
# This should be an interval tree if your data is large
# https://en.wikipedia.org/wiki/Interval_tree
# Exactly one of the first 3 is true, so we could use an `else`,
# listing all for clarity.
after = lambda v: v[END] < start
overlaps = lambda v: start <= v[END] and v[START] <= end
before = lambda v: end < v[START]
contained = lambda v: v[START] <= start and end <= v[END]
take_next = False
for speaker in list_2:
if take_next:
return speaker
if after(speaker):
continue
elif contained(speaker):
return speaker
elif overlaps(speaker):
take_next = True
elif after(speaker):
break # Missed it somehow (can't happen if full coverage)
raise LookupError('no speaker in range %s - %s' % (start, end))
# Prepare a list for phrases
for speakers in list_2:
speakers['words'] = []
# Populate phrases for each speaker
for phrase in list_1:
speaker = require_speaker(phrase[START], phrase[END])
speaker['words'].append(phrase['word'])
# Convert back to string
for speakers in list_2:
speakers['words'] = ' '.join(speakers['words'])
With your data
list_1 = [
{'word': 'hey hows it going?', 's1': 1.2, 's2': 3.6},
{'word': 'um', 's1': 3.7, 's2': 4.2},
{'word': 'its raining outside today', 's1': 4.3, 's2': 5.0},
{'word': 'and its really cold', 's1': 5.1, 's2': 6.6},
{'word': 'dont you think?', 's1': 6.7, 's2': 8.1},
{'word': 'its awful', 's1': 7.7, 's2': 9.0}
]
list_2 = [
{'category': 0, 's1': 0.0, 's2': 3.8},
{'category': 1, 's1': 3.9, 's2': 4.9},
{'category': 1, 's1': 5.0, 's2': 7.2},
{'category': 0, 's1': 7.3, 's2': 7.6},
{'category': 1, 's1': 7.7, 's2': 9.0}
]
You get
>>> import pprint
>>> pprint.pprint(list_2)
[{'category': 0, 's1': 0.0, 's2': 3.8, 'words': 'hey hows it going?'},
{'category': 1, 's1': 3.9, 's2': 4.9, 'words': 'um'},
{'category': 1,
's1': 5.0,
's2': 7.2,
'words': 'its raining outside today and its really cold'},
{'category': 0, 's1': 7.3, 's2': 7.6, 'words': 'dont you think?'},
{'category': 1, 's1': 7.7, 's2': 9.0, 'words': 'its awful'}]
Note that your expected output doesn't match your algorithm:
Upvotes: 1