sheldonzy
sheldonzy

Reputation: 5971

Python read lines from file by regex

I have a text file that I want to read into a list, in a certain format.

When I'm writing:

with open('chat_history.txt', encoding='utf8') as f:
    mylist = [line.rstrip('\n') for line in f]

I'm getting:

27/08/15, 15:45 - text
continue text
continue text 2
27/08/15, 16:10 - new text
new text 2
new text 3
27/08/15, 19:55 - more text

I'd like to get:

27/08/15, 15:45 - text continue text continue text 2
27/08/15, 16:10 - new text new text 2 new text 3
27/08/15, 19:55 - more text

I want to split only when I'm in the format \nDD/MM/YY, HH:MM - Unfortunately I'm not an expert at regular expressions. I tried:

with open('chat_history.txt', encoding='utf8') as f:
    mylist = [line.rstrip('\n'r'[\d\d/\d\d/\d\d - ]') for line in f]

Which gave the same result. In second thought, it makes sense why it doesn't work. Would love some help though.

Upvotes: 4

Views: 5645

Answers (3)

Rohan Talip
Rohan Talip

Reputation: 356

My solution uses a simpler regex than Jan's. The code using the regex is slightly more verbose though.

First, the input file:

$ cat -e chat_history.txt
27/08/15, 15:45 - text$
continue text$
continue text 2$
27/08/15, 16:10 - new text$
new text 2$
new text 3$
27/08/15, 19:55 - more text$

The code:

import re

date_time_regex = re.compile(r'^\d{2}/\d{2}/\d{2}, \d{2}:\d{2} - .*')

with open('chat_history.txt', encoding='utf8') as f:
    first_date = True
    for line in f:
        line = line.rstrip('\n')

        if date_time_regex.match(line):
            if not first_date:
                # Print a newline character before printing a date
                # if it is not the first date.
                print()
            else:
                first_date = False
        else:
            # Print a separator, without a newline character.
            print(' ', end='')

        # Print the original line, without a newline character.
        print(line, end='')

# Print the last newline character.
print()

Running the code (and showing no trailing spaces) :

$ python3 chat.py | cat -e
27/08/15, 15:45 - text continue text continue text 2$
27/08/15, 16:10 - new text new text 2 new text 3$
27/08/15, 19:55 - more text$

Upvotes: 2

Van Peer
Van Peer

Reputation: 2167

with open('chat_history.txt', encoding='utf8') as f:
    l = [line.rstrip('\n').replace('\n', ' ') for line in f]

print(l)

Upvotes: 2

Jan
Jan

Reputation: 43169

Admittedly, this might be way over the top and I'm sure there are other possibilities to accomplish the same. I'd like to present my solution here with (?(DEFINE)...) using the newer regex module. First the code, then the explanation:

import regex as re

string = """
27/08/15, 15:45 - text
continue text
continue text 2
27/08/15, 16:10 - new text
new text 2
new text 3
27/08/15, 19:55 - more text
"""

rx = re.compile(r'''
    (?(DEFINE)
        (?P<date>\d{2}/\d{2}/\d{2},\ \d{2}:\d{2}) # the date format
    )
    ^                    # anchor, start of the line
    (?&date)             # the previously defined format
    (?:(?!^(?&date)).)+  # "not date" as long as possible
''', re.M | re.X | re.S)


entries = (m.group(0).replace('\n', ' ') for m in rx.finditer(string))
for entry in entries:
    print(entry)

This yields:

27/08/15, 15:45 - text continue text continue text 2 
27/08/15, 16:10 - new text new text 2 new text 3 
27/08/15, 19:55 - more text 


Basically this approach looks for date blocks, separated by text in between:

date
text1
text2
date
text3
date
text

... and puts them together like

date text1 text2
date text3
date text

The "date format" is defined in the date group, afterwards the structure is as follows

date "match as long as there's no date in the next line"

This is achieved via a negative lookahead. Afterwards all found newlines are replaced with a space (in the comprehension, that is).
Obviously, one could as well get the same result without the regex module and the (?(DEFINE) block but we'd have to repeat ourselves in the matching and lookahead.
In the end, see a demo on regex101.com for the expression.

Upvotes: 2

Related Questions