Renaud
Renaud

Reputation: 2819

Regex pattern in Python

I'm looking for a pattern in Regex in Python to do the following:

For a text formatted like:

2021-01-01 10:00:05 - Surname1 Name1 (Comment)
Blablabla
Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes)
What?
I don't know?
2021-01-02 03:00:05 - Surname1 Name1 (Comment)
Blablabla!

I would like to return:

[(2021-01-01,10:00:05,Surname1 Name1,Comment,Blablabla/nBlabla),
(2021-01-01,23:00:05,Surname2 SurnameBis Name2,WorkNotes,What?/nI don't know?),
(2021-01-02,03:00:05,Surname1 Name1,Comment,Blablabla!)]

I managed to find a quiet close result with:

text2 = """2021-01-01 10:00:05 - Surname1 Name1 (Comment)
Blablabla
Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes)
What?
I don't know?
Can you be clear?
2021-01-02 03:00:05 - Surname1 Name1 (Comment)
Blablabla!"""
LangTag = re.findall("(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2}:\d{2})\s-\s(.*?)\((.*)\)\\n(.*)(?:\\n|$)", text2)
print(LangTag)

But I'm totally stuck to make appears the entire text I need to get... enter image description here

a solution can be to remove the \n from initial text but, I would like to avoid it because I need them later on... Any idea?

Upvotes: 1

Views: 140

Answers (5)

The fourth bird
The fourth bird

Reputation: 163207

The lines after the Comment/WorkNotes can match all lines that do not start with a date like pattern using a negative lookahead by changing the last part of the pattern to:

((?:(?!\d{4}-\d{2}-\d{2}).*(?:\n|$))*)

For example

import re
 
text2 = """2021-01-01 10:00:05 - Surname1 Name1 (Comment)
Blablabla
Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes)
What?
I don't know?
Can you be clear?
2021-01-02 03:00:05 - Surname1 Name1 (Comment)
Blablabla!"""
LangTag = re.findall(r"(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2}:\d{2})\s-\s(.*?)\((.*)\)\n((?:(?!\d{4}-\d{2}-\d{2}).*(?:\n|$))*)", text2)
print(LangTag)

Output

[
('2021-01-01', '10:00:05', 'Surname1 Name1 ', 'Comment', 'Blablabla\nBlabla\n'), 
('2021-01-01', '23:00:05', 'Surname2 SurnameBis Name2 ', 'WorkNotes', "What?\nI don't know?\nCan you be clear?\n"), 
('2021-01-02', '03:00:05', 'Surname1 Name1 ', 'Comment', 'Blablabla!')
]

Regex demo | Python demo

Upvotes: 0

dragon2fly
dragon2fly

Reputation: 2419

You could approach your problem by solving for the first block. Then repeat the solution to the end of your data. By doing this divide and conquer strategy, the code is simple to understand and yet can solve the bigger problem and can be extended easily.

import re

data = '''2021-01-01 10:00:05 - Surname1 Name1 (Comment)
Blablabla
Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes)
What?
I don't know?
2021-01-02 03:00:05 - Surname1 Name1 (Comment)
Blablabla!'''.splitlines()

first_line_patt = re.compile(r'^(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) - (.*)(?= \() \((.*)\)$')


def parse_block(lines, idx):
    # parse the meta line
    res = first_line_patt.findall(lines[idx])

    # get the message
    message = []
    while idx < len(lines)-1:
        line = lines[idx + 1]
        idx += 1

        # check if next line is a meta line
        if first_line_patt.match(line):
            break

        # if not, it is a message line
        message.append(line)

    res.append('\n'.join(message))
    return res, idx


idx = 0
while True:
    res, idx = parse_block(data, idx)
    if not res[0]:
        break
    print(res)

This produces the following result:

[('2021-01-01', '10:00:05', 'Surname1 Name1', 'Comment'), 'Blablabla\nBlabla']
[('2021-01-01', '23:00:05', 'Surname2 SurnameBis Name2', 'WorkNotes'), "What?\nI don't know?"]
[('2021-01-02', '03:00:05', 'Surname1 Name1', 'Comment'), 'Blablabla!']

Upvotes: 1

adamkwm
adamkwm

Reputation: 1173

My solution is almost the same as yours, but turning group 5 from .* to \D*, so it will match everything until the next number.

import re
text = """2021-01-01 10:00:05 - Surname1 Name1 (Comment)
Blablabla
Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes)
What?
I don't know?
2021-01-02 03:00:05 - Surname1 Name1 (Comment)
Blablabla!"""
result = re.findall(r"(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2}:\d{2})\s-\s(.*?)\((.*)\)\n(\D*)(?:\n|$)", text)
print(result)

Output:

[('2021-01-01', '10:00:05', 'Surname1 Name1 ', 'Comment', 'Blablabla\nBlabla'),
 ('2021-01-01', '23:00:05', 'Surname2 SurnameBis Name2 ', 'WorkNotes', "What?\nI don't know?"), 
 ('2021-01-02', '03:00:05', 'Surname1 Name1 ', 'Comment', 'Blablabla!')]

Upvotes: 1

balderman
balderman

Reputation: 23815

Try the below.
The key point here is if line[0].isdigit() that is able to identify that a new section is starting.

Important note : this solution is removing the \n while it merges the sections.

data = '''2021-01-01 10:00:05 - Surname1 Name1 (Comment)
Blablabla
Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes)
What?
I don't know?
2021-01-02 03:00:05 - Surname1 Name1 (Comment)
Blablabla!'''


holder = []
lines = data.split('\n')
temp = []
for line in lines:
    if line[0].isdigit():
        if temp:
            holder.append(' '.join(temp))
            temp = []
    temp.append(line)
holder.append(' '.join(temp))

for line in holder:
    print(line)

output

2021-01-01 10:00:05 - Surname1 Name1 (Comment) Blablabla Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes) What? I don't know?
2021-01-02 03:00:05 - Surname1 Name1 (Comment) Blablabla!

Upvotes: 0

Leo
Leo

Reputation: 1303

You can parse your data like this.

import re

data = """2021-01-01 10:00:05 - Surname1 Name1 (Comment)
Blablabla
Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes)
What?
I don't know?
2021-01-02 03:00:05 - Surname1 Name1 (Comment)
Blablabla!"""

def parse(data):
    text = ""
    match = None
    messages = []
    for line in data.split("\n"):
        m = re.match("^(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) - (.*?) \((.*?)\)$", line)
        if m:
            if match:
                msg = (match.group(1), match.group(2), match.group(3), match.group(4), text)
                messages.append(msg)
            match = m
        else:
            text += line + "\n"
    msg = (match.group(1), match.group(2), match.group(3), match.group(4), text)
    messages.append(msg)
    return messages

for message in parse(data):
    print(message)

This outputs

('2021-01-01', '10:00:05', 'Surname1 Name1', 'Comment', 'Blablabla\nBlabla\n')
('2021-01-01', '23:00:05', 'Surname2 SurnameBis Name2', 'WorkNotes', "Blablabla\nBlabla\nWhat?\nI don't know?\n")
('2021-01-02', '03:00:05', 'Surname1 Name1', 'Comment', "Blablabla\nBlabla\nWhat?\nI don't know?\nBlablabla!\n")

Upvotes: 2

Related Questions