Reputation: 2485
I have a text file that follows this format.
LESTER HOLT (00:00:01): Breaking News Tonight: A deadly mass shooting at the airport. A gunman opens fire at baggage claim in Fort Lauderdale, witnesses describing scenes of sheer horror. A silent killer shooting people in the head as they tried to run and hide. Tonight, a storm of questions. Why did he do it? The suspect, a passenger with a firearm in his checked bag. New concerns about airport security before the checkpoint.
(00:00:25): Also breaking tonight the new report from U.S. intelligence: Vladimir Putin himself ordered the effort to influence the election, aimed at hurting Clinton and helping Trump win. What the President-elect is saying after his top-secret briefing.
(00:00:39): And States of Emergency: Millions from coast to coast paralyzed by a massive winter storm.
(00:00:45): NIGHTLY NEWS begins right now.
I am trying to parse this information into a Python Dictionary, where the speaker is a dictionary, of dictionaries, which has timecode keys, and the content is the value, I can't consistently split because of potential information before the timecode, (IE the first quote), as well as the fact that the split character :
is also a character involved with the timecode itself 00:00:00
.
Trying to split according to the regex.
for line in msg.get_payload().split('\n'):
regex = r'\d{2}:\d{2}:\d{2}'
test = re.split(regex, line)
print(test)
sleep(1)
Appears to work in splitting it properly, but it causes me to lose the value I am splitting on (timecode), which I intend to use as a key. How can I properly split the above content, get the speaker, and then get the timecode as a key, and the content as a value.It is possible he may be present later in the text as well, and it should append to the list of timecodes./
The output format I am targeting is something along the lines of
{speakers:{'Lester Holt': {'00:00:01':content..., '00:00:0025': content...},
'speaker2':{etc,etc,etc} }}
Ive tried using the split as mentioned above, but it removes my timecode variable.
Any thoughts and guidance is appreciated.
Upvotes: 0
Views: 1964
Reputation: 7840
Don't bother with split
. You're trying to get 2-3 pieces of information out of each line, so try the following:
for line in msg.get_payload().split('\n'):
match = re.search(r'^\s*([^(]*?)\s*\((\d{2}:\d{2}:\d{2})\):\s*(.*)$', line)
if match:
(speaker, time, message) = match.groups()
Speaker will be an empty string if none was present on that line.
Regex explanation:
^ # Start of line
\s* # Drop leading whitespace
([^(]*?) # Capture the speaker if present (non-paren characters)
\s* # Drop whitespace between name and time
\( # Drop literal open paren
(\d{2}:\d{2}:\d{2}) # Capture time
\):\s* # Drop literal close paren, colon and whitespace
(.*) # Capture the rest of the line
$ # End of line
Upvotes: 2
Reputation: 1506
Splitting message in lines when you need to split it in time-stamped paragraphs is a waste. re.split
can easily save the tokens that it split on, if you only look at the documentation. Here's my solution:
toks = re.split(r"\((\d\d:\d\d:\d\d)\):", msg.get_payload())[1:]
answer = dict(zip(toks[::2], toks[1::2]))
This creates a dictionary of timestamps and paragraphs. Feel free to use the same approach to split by speaker as well.
Result: { '00:00:01': ' Breaking News Tonight: A .....', '00:00:25': ' Also breaking tonight ......', .... }
Upvotes: 2