Reputation: 165
I have a txt file which has text in this manner:
[2018-07-11 20:57:08] SYSTEM RESPONSE: "hello"
[2018-07-11 20:57:19] USER INPUT (xvp_dev-0): "hi! how is it going?"
[2018-07-11 20:57:19] SYSTEM RESPONSE: "It's going pretty good.
How about you?"
[2018-07-11 14:05:20] USER INPUT (xvp_dev-0): I've been doing good too!
Thank you.
[2018-07-12 14:05:20] SYSTEM RESPONSE: "Hello!"
How is your day going today?
[2018-07-12 14:05:34] USER INPUT (xvp_dev-0): "Great! Can't complain"
[2018-07-12 14:05:34] SYSTEM RESPONSE: "Okay.
That's good"
Now, I want all the lines from the first occurrence of [2018-07-11] to the last, and all the line in between. Currently, I am just finding all the lines that start with [2018-07-11.. and displaying them, but if you notice, there are few lines which are in between them too which are getting lost.
for line in file:
if b in line: #b = system input of date
x = x + "//" + line[11:]
else:
x=x
Sample output would be something like: For the date 2018-11-17:
20:57:08] SYSTEM RESPONSE: "hello"
20:57:19] USER INPUT (xvp_dev-0): "hi! how is it going?"
20:57:19] SYSTEM RESPONSE: "It's going pretty good.
How about you?"
14:05:20] USER INPUT (xvp_dev-0): I've been doing good too!
Thank you.
for the date: 2018-07-12:
14:05:20] SYSTEM RESPONSE: "Hello!"
How is your day going today?
14:05:34] USER INPUT (xvp_dev-0): "Great! Can't complain"
14:05:34] SYSTEM RESPONSE: "Okay.
That's good"
Any idea on how I would be able to get the lines in between too? Since it all depends on dates- there is no way an occurrence of a that can happen later on in the text.
Upvotes: 0
Views: 106
Reputation: 71451
You can use re.findall
to parse the data, and then itertools.groupby
:
import re
dates = re.findall('\[.*?\]', content)
content = [re.findall('(?<=:)[\w\W]+', i) for i in re.sub('\[.*?\]', '*', content).split('*')]
final_content = [re.sub('\n+|\s{2,}', '', ''.join(i)) for i in content if i]
d = list(zip(dates, final_content))
new_d= [[a, list(b)] for a, b in itertools.groupby(sorted(d, key=lambda x:re.findall('\d+\-\d+\-\d+', x[0])[0]), key=lambda x:re.findall('\d+\-\d+\-\d+', x[0])[0])]
final_result = {a:[c for _, c in b] for a, b in new_d}
Output:
{'2018-07-12': [' "Hello!"How is your day going today?',
' "Great! Can\'t complain"',
' "Okay.That\'s good"'],
'2018-07-11': [' "hello"',
' "hi! how is it going?"',
' "It\'s going pretty good.How about you?"',
" I've been doing good too!Thank you."]}
Now, all the responses found for each date are contained in a list as a value in a dictionary with the date itself as the key.
Upvotes: 0
Reputation: 195438
You can use regular expressions to parse the lines. I made a function find_lines_by_date()
where you can supply the date string and it will return a list of lines with this date:
data = """
[2018-07-11 20:57:08] SYSTEM RESPONSE: "hello"
[2018-07-11 20:57:19] USER INPUT (xvp_dev-0): "hi! how is it going?"
[2018-07-11 20:57:19] SYSTEM RESPONSE: "It's going pretty good.
How about you?"
[2018-07-11 14:05:20] USER INPUT (xvp_dev-0): I've been doing good too!
Thank you.
[2018-07-12 14:05:20] SYSTEM RESPONSE: "Hello!"
How is your day going today?
[2018-07-12 14:05:34] USER INPUT (xvp_dev-0): "Great! Can't complain"
[2018-07-12 14:05:34] SYSTEM RESPONSE: "Okay.
That's good"
"""
import re
import pprint
def find_lines_by_date(date='2018-07-11'):
rv = []
groups = re.findall(r'(\[(.*?)\s+.*?\][^\[]+)', data)
for g in groups:
if g[-1] == date:
rv.append(g[0].strip())
return rv
pprint.pprint(find_lines_by_date(date='2018-07-12'))
This will print:
['[2018-07-12 14:05:20] SYSTEM RESPONSE: "Hello!"\n'
' How is your day going today?',
'[2018-07-12 14:05:34] USER INPUT (xvp_dev-0): "Great! Can\'t complain"',
'[2018-07-12 14:05:34] SYSTEM RESPONSE: "Okay.\n That\'s good"']
EDIT:
The regexp (\[(.*?)\s+.*?\][^\[]+)
will match the string for all two-valued groups (first value in the group contains all the line for return value, second value in the group is the date for comparison).
I made a simple example on external site with detailed explication:
Upvotes: 5