User2939
User2939

Reputation: 165

Include all lines in between the first and last occurrence

I have a txt file which has text in this manner:

    [2018-07-11 20:57:08] SYSTEM RESPONSE: "hello"
    [2018-07-11 20:57:19] USER INPUT (xvp_dev-0): "hi! how is it going?"
    [2018-07-11 20:57:19] SYSTEM RESPONSE: "It's going pretty good. 
     How about you?"
    [2018-07-11 14:05:20] USER INPUT (xvp_dev-0): I've been doing good too!

    Thank you.
    [2018-07-12 14:05:20] SYSTEM RESPONSE: "Hello!"
    How is your day going today?
    [2018-07-12 14:05:34] USER INPUT (xvp_dev-0): "Great! Can't complain"
    [2018-07-12 14:05:34] SYSTEM RESPONSE: "Okay. 
    That's good"

Now, I want all the lines from the first occurrence of [2018-07-11] to the last, and all the line in between. Currently, I am just finding all the lines that start with [2018-07-11.. and displaying them, but if you notice, there are few lines which are in between them too which are getting lost.

for line in file:
    if b in line: #b = system input of date
       x = x + "//" + line[11:]
    else:
       x=x

Sample output would be something like: For the date 2018-11-17:

20:57:08] SYSTEM RESPONSE: "hello"
20:57:19] USER INPUT (xvp_dev-0): "hi! how is it going?"
20:57:19] SYSTEM RESPONSE: "It's going pretty good. 
How about you?"
14:05:20] USER INPUT (xvp_dev-0): I've been doing good too!
Thank you.

for the date: 2018-07-12:

14:05:20] SYSTEM RESPONSE: "Hello!"
How is your day going today?
14:05:34] USER INPUT (xvp_dev-0): "Great! Can't complain"
14:05:34] SYSTEM RESPONSE: "Okay. 
That's good"

Any idea on how I would be able to get the lines in between too? Since it all depends on dates- there is no way an occurrence of a that can happen later on in the text.

Upvotes: 0

Views: 106

Answers (2)

Ajax1234
Ajax1234

Reputation: 71451

You can use re.findall to parse the data, and then itertools.groupby:

import re
dates = re.findall('\[.*?\]', content)
content = [re.findall('(?<=:)[\w\W]+', i) for i in re.sub('\[.*?\]', '*', content).split('*')]
final_content = [re.sub('\n+|\s{2,}', '', ''.join(i)) for i in content if i]
d = list(zip(dates, final_content))
new_d= [[a, list(b)] for a, b in itertools.groupby(sorted(d, key=lambda x:re.findall('\d+\-\d+\-\d+', x[0])[0]), key=lambda x:re.findall('\d+\-\d+\-\d+', x[0])[0])]
final_result = {a:[c for _, c in b] for a, b in new_d}

Output:

{'2018-07-12': [' "Hello!"How is your day going today?', 
                ' "Great! Can\'t complain"', 
                ' "Okay.That\'s good"'], 
 '2018-07-11': [' "hello"', 
                ' "hi! how is it going?"', 
                ' "It\'s going pretty good.How about you?"', 
                " I've been doing good too!Thank you."]}

Now, all the responses found for each date are contained in a list as a value in a dictionary with the date itself as the key.

Upvotes: 0

Andrej Kesely
Andrej Kesely

Reputation: 195438

You can use regular expressions to parse the lines. I made a function find_lines_by_date() where you can supply the date string and it will return a list of lines with this date:

data = """
    [2018-07-11 20:57:08] SYSTEM RESPONSE: "hello"
    [2018-07-11 20:57:19] USER INPUT (xvp_dev-0): "hi! how is it going?"
    [2018-07-11 20:57:19] SYSTEM RESPONSE: "It's going pretty good.
     How about you?"
    [2018-07-11 14:05:20] USER INPUT (xvp_dev-0): I've been doing good too!

    Thank you.
    [2018-07-12 14:05:20] SYSTEM RESPONSE: "Hello!"
    How is your day going today?
    [2018-07-12 14:05:34] USER INPUT (xvp_dev-0): "Great! Can't complain"
    [2018-07-12 14:05:34] SYSTEM RESPONSE: "Okay.
    That's good"
"""

import re
import pprint

def find_lines_by_date(date='2018-07-11'):
    rv = []
    groups = re.findall(r'(\[(.*?)\s+.*?\][^\[]+)', data)
    for g in groups:
        if g[-1] == date:
            rv.append(g[0].strip())
    return rv


pprint.pprint(find_lines_by_date(date='2018-07-12'))

This will print:

['[2018-07-12 14:05:20] SYSTEM RESPONSE: "Hello!"\n'
 '    How is your day going today?',
 '[2018-07-12 14:05:34] USER INPUT (xvp_dev-0): "Great! Can\'t complain"',
 '[2018-07-12 14:05:34] SYSTEM RESPONSE: "Okay.\n    That\'s good"']

EDIT:

The regexp (\[(.*?)\s+.*?\][^\[]+) will match the string for all two-valued groups (first value in the group contains all the line for return value, second value in the group is the date for comparison).

I made a simple example on external site with detailed explication:

Upvotes: 5

Related Questions