Imran
Imran

Reputation: 656

How to read a text file where some of the contents have line breaks?

I have a text file of this form:

06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde

You can see that every line is separated by a line break, but some row contents have line breaks in them. So, simply separating by line doesn't parse every line properly.

As an example, for the 5th entry, I want my output to be 07/01/2016, 6:14 pm - abcde fghe

Here is my current code:

with open('file.txt', 'r') as text_file:
data = []
for line in text_file:
    row = line.strip()
    data.append(row)

Upvotes: 3

Views: 1043

Answers (4)

dawg
dawg

Reputation: 103844

Given your example input, you can use a regex with a forward lookahead:

pat=re.compile(r'^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)', re.S | re.M)

with open (fn) as f:
    pprint([m.group(1) for m in pat.finditer(f.read())])    

Prints:

['06/01/2016, 10:40 pm - abcde\n',
 '07/01/2016, 12:04 pm - abcde\n',
 '07/01/2016, 12:05 pm - abcde\n',
 '07/01/2016, 12:05 pm - abcde\n',
 '07/01/2016, 6:14 pm - abcde\n\nfghe\n',
 '07/01/2016, 6:20 pm - abcde\n',
 '07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl\n',
 '07/01/2016, 7:58 pm - abcde\n']

With the Dropbox example, prints:

['11/11/2015, 3:16 pm - IK: 12\n',
 '13/11/2015, 12:10 pm - IK: Hi.\n\nBut this is not about me.\n\nA donation, however small, will go a long way.\n\nThank you.\n',
 '13/11/2015, 12:11 pm - IK: Boo\n',
 '15/11/2015, 8:36 pm - IR: Root\n',
 '15/11/2015, 8:36 pm - IR: LaTeX?\n',
 '15/11/2015, 8:43 pm - IK: Ws\n']

If you want to delete the \n in what is captured, just add m.group(1).strip().replace('\n', '') to the list comprehension above.


Explanation of regex:

^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)

^                                                       start of line   
    ^  ^  ^  ^   ^                                      pattern for a date  
                       ^                                capture the rest...  
                           ^                            until (look ahead)
                                      ^ ^ ^             another date
                                                  ^     or
                                                     ^  end of string

Upvotes: 1

handle
handle

Reputation: 6319

Simple testing for length:

#!python3
#coding=utf-8

data = """06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde"""

lines = data.split("\n")
out = []
for l in lines:
    c = l.strip()
    if c:
        if len(c) < 10:
            out[-1] += c
        else:
            out.append(c)
    #skip empty

for o in out:
    print(o)

results in:

06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcdefghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcdefgheijkl
07/01/2016, 7:58 pm - abcde

Does not contain the line breaks in the data!


But this one liner regular expression should do it (split on linebreak followed by digit), at least for the sample data (breaks when data contains linebreak followed by digit):

#!python3
#coding=utf-8

text_file = """06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde"""

import re
data = re.split("\n(?=\d)", text_file)

print(data)

for d in data:
    print(d)

Output:

   ['06/01/2016, 10:40 pm - abcde', '07/01/2016, 12:04 pm - abcde', '07/01/2016, 12:05 pm - abcde', '07/01/2016, 12:05 pm - abcde', '07/01/2016, 6:14 pm - abcde\n\
nfghe', '07/01/2016, 6:20 pm - abcde', '07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl', '07/01/2016, 7:58 pm - abcde']
06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde

(fixed with lookahead)

Upvotes: 0

hugos
hugos

Reputation: 1323

Considering that ',' can only appear as a separator, we may check if the line has a comma and concatenate it to the last row if it doesn't:

data = []

with open('file.txt', 'r') as text_file:
    for line in text_file:
        row = line.strip()
        if ',' not in row:
            data[-1] += '\n' + row
        else:
            data.append(row)

Upvotes: 1

user2390182
user2390182

Reputation: 73460

You could use regular expressions (using the re module) to check for dates like this:

import re
with open('file.txt', 'r') as text_file:
  data = []
  for line in text_file:
    row = line.strip()
    if re.match(r'\d{2}/\d{2}/\d{4}.*'):  
      data.append(row)  # date: new record
    else:
      data[-1] += '\n' + row  # no date: append to last record

# '\d{2}': two digits
# '.*': any character, zero or more times

Upvotes: 0

Related Questions