curious_cosmo
curious_cosmo

Reputation: 1214

Splitting up string of text based on keywords python

I have a string of text like this:

'tx cycle up.... down
rx cycle up.... down
phase:...
rx on scan: 123456
tx cycle up.... down
rx cycle up.... down
phase:...
rx on scan: 789012
setup
tx cycle up.... down
rx cycle up.... down
tx cycle up.... down
rx cycle up.... down'

I need to split this string up into a list of strings that are split up into these chunks:

['tx cycle up.... down rx cycle up.... down phase:.... rx on scan: 123456', 
 'tx cycle up.... down rx cycle up.... down phase:.... rx on scan: 789012',
 'tx cycle up... down rx cycle up.... down',
 'tx cycle up... down rx cycle up.... down']

Sometimes they have a 'phase' and 'scan' number but sometimes they do not, and I need this to be general enough to apply to any of these cases and will have to do this to lots of data.

Basically, I want to split it into a list of strings where each element extends from an occurrence of 'tx' to the next 'tx' (including the first 'tx' but not the next one in that element). How can I do this?

Edit: Suppose that in addition to the string of text above I have other strings of text that appear like this:

'closeloop start
closeloop ..up:677 down:098
closeloop start
closeloop ..up:568 down:123'

My code is going through each of the strings of text and splitting it into lists with the splitting code. But when it gets to this string of text it won't find anything to split -- so how can I include a command to split at the 'closeloop start' lines if they appear and the tx lines like before if those appear? I tried this code but I got a TypeError:

data = re.split(r'\n((?=tx)|(?=closeloop\sstart))', data)

Upvotes: 2

Views: 3150

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121216

You can split on newlines that are followed by tx:

import re

re.split(r'\n(?=tx)', inputtext)

Demo:

>>> import re
>>> inputtext = '''tx cycle up.... down
... rx cycle up.... down
... phase:...
... rx on scan: 123456
... tx cycle up.... down
... rx cycle up.... down
... phase:...
... rx on scan: 789012
... setup
... tx cycle up.... down
... rx cycle up.... down
... tx cycle up.... down
... rx cycle up.... down'''
>>> re.split(r'\n(?=tx)', inputtext)
['tx cycle up.... down\nrx cycle up.... down\nphase:...\nrx on scan: 123456', 'tx cycle up.... down\nrx cycle up.... down\nphase:...\nrx on scan: 789012\nsetup', 'tx cycle up.... down\nrx cycle up.... down', 'tx cycle up.... down\nrx cycle up.... down']
>>> from pprint import pprint
>>> pprint(_)
['tx cycle up.... down\nrx cycle up.... down\nphase:...\nrx on scan: 123456',
 'tx cycle up.... down\nrx cycle up.... down\nphase:...\nrx on scan: 789012\nsetup',
 'tx cycle up.... down\nrx cycle up.... down',
 'tx cycle up.... down\nrx cycle up.... down']

However, if you were to just loop over the input file object (reading line by line), you could just process each block as you gather lines:

section = []
for line in open_file_object:
    if line.startswith('tx'):
        # new section
        if section:
            process_section(section)
        section = [line]
    else:
        section.append(line)
if section:
    process_section(section)

If you need to match multiple starting lines, include each as a |-separated alternative in the look-ahead:

data = re.split(r'\n(?=tx|closeloop\sstart)', data)

Upvotes: 8

Related Questions