Reputation: 656
I have a text file of this form:
06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde
fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde
fghe
ijkl
07/01/2016, 7:58 pm - abcde
You can see that every line is separated by a line break, but some row contents have line breaks in them. So, simply separating by line doesn't parse every line properly.
As an example, for the 5th entry, I want my output to be
07/01/2016, 6:14 pm - abcde fghe
Here is my current code:
with open('file.txt', 'r') as text_file:
data = []
for line in text_file:
row = line.strip()
data.append(row)
Upvotes: 3
Views: 1043
Reputation: 103844
Given your example input, you can use a regex with a forward lookahead:
pat=re.compile(r'^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)', re.S | re.M)
with open (fn) as f:
pprint([m.group(1) for m in pat.finditer(f.read())])
Prints:
['06/01/2016, 10:40 pm - abcde\n',
'07/01/2016, 12:04 pm - abcde\n',
'07/01/2016, 12:05 pm - abcde\n',
'07/01/2016, 12:05 pm - abcde\n',
'07/01/2016, 6:14 pm - abcde\n\nfghe\n',
'07/01/2016, 6:20 pm - abcde\n',
'07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl\n',
'07/01/2016, 7:58 pm - abcde\n']
With the Dropbox example, prints:
['11/11/2015, 3:16 pm - IK: 12\n',
'13/11/2015, 12:10 pm - IK: Hi.\n\nBut this is not about me.\n\nA donation, however small, will go a long way.\n\nThank you.\n',
'13/11/2015, 12:11 pm - IK: Boo\n',
'15/11/2015, 8:36 pm - IR: Root\n',
'15/11/2015, 8:36 pm - IR: LaTeX?\n',
'15/11/2015, 8:43 pm - IK: Ws\n']
If you want to delete the \n
in what is captured, just add m.group(1).strip().replace('\n', '')
to the list comprehension above.
Explanation of regex:
^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)
^ start of line
^ ^ ^ ^ ^ pattern for a date
^ capture the rest...
^ until (look ahead)
^ ^ ^ another date
^ or
^ end of string
Upvotes: 1
Reputation: 6319
Simple testing for length:
#!python3
#coding=utf-8
data = """06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde
fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde
fghe
ijkl
07/01/2016, 7:58 pm - abcde"""
lines = data.split("\n")
out = []
for l in lines:
c = l.strip()
if c:
if len(c) < 10:
out[-1] += c
else:
out.append(c)
#skip empty
for o in out:
print(o)
results in:
06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcdefghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcdefgheijkl
07/01/2016, 7:58 pm - abcde
Does not contain the line breaks in the data!
But this one liner regular expression should do it (split on linebreak followed by digit), at least for the sample data (breaks when data contains linebreak followed by digit):
#!python3
#coding=utf-8
text_file = """06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde
fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde
fghe
ijkl
07/01/2016, 7:58 pm - abcde"""
import re
data = re.split("\n(?=\d)", text_file)
print(data)
for d in data:
print(d)
Output:
['06/01/2016, 10:40 pm - abcde', '07/01/2016, 12:04 pm - abcde', '07/01/2016, 12:05 pm - abcde', '07/01/2016, 12:05 pm - abcde', '07/01/2016, 6:14 pm - abcde\n\
nfghe', '07/01/2016, 6:20 pm - abcde', '07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl', '07/01/2016, 7:58 pm - abcde']
06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde
fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde
fghe
ijkl
07/01/2016, 7:58 pm - abcde
(fixed with lookahead)
Upvotes: 0
Reputation: 1323
Considering that ','
can only appear as a separator, we may check if the line has a comma and concatenate it to the last row if it doesn't:
data = []
with open('file.txt', 'r') as text_file:
for line in text_file:
row = line.strip()
if ',' not in row:
data[-1] += '\n' + row
else:
data.append(row)
Upvotes: 1
Reputation: 73460
You could use regular expressions (using the re
module) to check for dates like this:
import re
with open('file.txt', 'r') as text_file:
data = []
for line in text_file:
row = line.strip()
if re.match(r'\d{2}/\d{2}/\d{4}.*'):
data.append(row) # date: new record
else:
data[-1] += '\n' + row # no date: append to last record
# '\d{2}': two digits
# '.*': any character, zero or more times
Upvotes: 0