Reputation: 5971
I have a text file that I want to read into a list, in a certain format.
When I'm writing:
with open('chat_history.txt', encoding='utf8') as f:
mylist = [line.rstrip('\n') for line in f]
I'm getting:
27/08/15, 15:45 - text
continue text
continue text 2
27/08/15, 16:10 - new text
new text 2
new text 3
27/08/15, 19:55 - more text
I'd like to get:
27/08/15, 15:45 - text continue text continue text 2
27/08/15, 16:10 - new text new text 2 new text 3
27/08/15, 19:55 - more text
I want to split only when I'm in the format \nDD/MM/YY, HH:MM -
Unfortunately I'm not an expert at regular expressions. I tried:
with open('chat_history.txt', encoding='utf8') as f:
mylist = [line.rstrip('\n'r'[\d\d/\d\d/\d\d - ]') for line in f]
Which gave the same result. In second thought, it makes sense why it doesn't work. Would love some help though.
Upvotes: 4
Views: 5645
Reputation: 356
My solution uses a simpler regex than Jan's. The code using the regex is slightly more verbose though.
First, the input file:
$ cat -e chat_history.txt
27/08/15, 15:45 - text$
continue text$
continue text 2$
27/08/15, 16:10 - new text$
new text 2$
new text 3$
27/08/15, 19:55 - more text$
The code:
import re
date_time_regex = re.compile(r'^\d{2}/\d{2}/\d{2}, \d{2}:\d{2} - .*')
with open('chat_history.txt', encoding='utf8') as f:
first_date = True
for line in f:
line = line.rstrip('\n')
if date_time_regex.match(line):
if not first_date:
# Print a newline character before printing a date
# if it is not the first date.
print()
else:
first_date = False
else:
# Print a separator, without a newline character.
print(' ', end='')
# Print the original line, without a newline character.
print(line, end='')
# Print the last newline character.
print()
Running the code (and showing no trailing spaces) :
$ python3 chat.py | cat -e
27/08/15, 15:45 - text continue text continue text 2$
27/08/15, 16:10 - new text new text 2 new text 3$
27/08/15, 19:55 - more text$
Upvotes: 2
Reputation: 2167
with open('chat_history.txt', encoding='utf8') as f:
l = [line.rstrip('\n').replace('\n', ' ') for line in f]
print(l)
Upvotes: 2
Reputation: 43169
Admittedly, this might be way over the top and I'm sure there are other possibilities to accomplish the same. I'd like to present my solution here with (?(DEFINE)...)
using the newer regex
module. First the code, then the explanation:
import regex as re
string = """
27/08/15, 15:45 - text
continue text
continue text 2
27/08/15, 16:10 - new text
new text 2
new text 3
27/08/15, 19:55 - more text
"""
rx = re.compile(r'''
(?(DEFINE)
(?P<date>\d{2}/\d{2}/\d{2},\ \d{2}:\d{2}) # the date format
)
^ # anchor, start of the line
(?&date) # the previously defined format
(?:(?!^(?&date)).)+ # "not date" as long as possible
''', re.M | re.X | re.S)
entries = (m.group(0).replace('\n', ' ') for m in rx.finditer(string))
for entry in entries:
print(entry)
This yields:
27/08/15, 15:45 - text continue text continue text 2
27/08/15, 16:10 - new text new text 2 new text 3
27/08/15, 19:55 - more text
date
text1
text2
date
text3
date
text
... and puts them together like
date text1 text2
date text3
date text
The "date format" is defined in the date group, afterwards the structure is as follows
date "match as long as there's no date in the next line"
This is achieved via a negative lookahead. Afterwards all found newlines are replaced with a space (in the comprehension, that is).
Obviously, one could as well get the same result without the regex
module and the (?(DEFINE)
block but we'd have to repeat ourselves in the matching and lookahead.
In the end, see a demo on regex101.com for the expression.
Upvotes: 2