Find all instances in text, last word should also be beginning of search with regex for python

Question

I am not able to find the solution for a regex problem i have. This is actually a sort of follow up question to this post: Find string between two substrings AND between string and the end of file

I have created the following example text (in my application the text is a lot longer and multiple files etc):

Course 22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08 Date 22/09/2010 Duty 1. Early duty 1.3 Here there can be some other related stuff Nursegoals Interventions Record This is now the fourth note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record This is a new, note (again), i call it note 3. Course 22/09/2010 1. Early duty Record This is again a note, i call it note 2. Apple: 0/less Course 22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010 Course 3. Nightduty 1.3 Something else here Nursegoals Interventions Record 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record Course 22/09/2010 3. Nightduty Record This is a new note, i call it note 1.

Now i want to parse specific information from this text. My interest is the 'Record', so the text part that is behind the Record. and the date for that specific record, whit date i mean the date like 02-11-2010 and the notion of early duty, late duty or night duty (so a date would be: '02-09-2010 1. Early duty'). The problem i have is that there is no real consistency in the files, so sometimes there are 2 note's for one date and other times there is just one. Also sometimes the note section containt text and other times it does not.

I know how to parse the Record section, but i did not know how to parse first the date and then the note section(s). So i though to split the problem in two. My first step is, split the whole file into seperate date sections. Second step: iterate through all date sections to get the note(s) for that specific date section (with a regex). I would then make a sort of list which would containt the specific date (if i would want only the specific date, to put it in a column cell for example i would simply parse the first 13 characters of that date section.) and the note(s) that are related to that date. For example:

list = [02-08-2010 1. Early duty, [note1, note2], 02-08-2010 2. Late duty, [note1], etc]

Let's just focus on the date parsing so my problem is clear. I use the following code:

date = r'Course\s+(.*?)(?:Course|$)'
date_list = re.findall(date, text, re.DOTALL)
for i in date_list: 
   print (i)
   print ('XXX')

The output is:

22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08 Date22/09/2010 Duty 1. Early duty 1.3 Here there can be some other related stuff Nursegoals Interventions Record This is now the fourth note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record This is a new, note (again), i call it note 3. XXX 22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010 XXX 22/09/2010 3. Nightduty Record This is a new note, i call it note 1. XXX

This output misses the following elements:

['Course 22/09/2010 1. Early duty Record This is again a note, i call it note 2. Apple: 0/less']

and

['3. Nightduty 1.3 Something else here Nursegoals Interventions Record 6.2.1.3 Confusion: Observing. Nursegoals Interventions']

So it sort of hops over as i think the regex doest consider the end of the word 'Course', als as the beginning of a new so to say match.

It would really be great if someone could help me:) Probably i am missing something..

Wiktor Stribiżew · Accepted Answer

Change the non-capturing group to a positive lookahead:

r'Course\s+(.*?)(?=Course|$)'
                 ^^

See the regex demo. An unrolled, faster, variant is r'Course\s+([^C]*(?:C(?!ourse)[^C]*)*)' (see demo).

Otherwise, the overlapping substrings do not get matched.

Python demo:

import re
rx = r"Course\s+(.*?)(?=Course|$)"
s = "Course 22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08 Date 22/09/2010 Duty 1. Early duty 1.3 Here there can be some other related stuff Nursegoals Interventions Record This is now the fourth note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record This is a new, note (again), i call it note 3. Course 22/09/2010 1. Early duty Record This is again a note, i call it note 2. Apple: 0/less Course 22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010 Course 3. Nightduty 1.3 Something else here Nursegoals Interventions Record 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record Course 22/09/2010 3. Nightduty Record This is a new note, i call it note 1."
results = re.findall(rx, s, re.DOTALL)
for x in results:
    print(x)

Output:

22/09/2010 1. Early duty Josephine, Jansen 22-09-2010 10:37:08 Date 22/09/2010 Duty 1. Early duty 1.3 Here there can be some other related stuff Nursegoals Interventions Record This is now the fourth note. 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record This is a new, note (again), i call it note 3. 
22/09/2010 1. Early duty Record This is again a note, i call it note 2. Apple: 0/less 
22/09/2010 3. Nightduty Josephine, Jansen 22-09-2010 06:22:25 Date 22/09/2010 
3. Nightduty 1.3 Something else here Nursegoals Interventions Record 6.2.1.3 Confusion: Observing. Nursegoals Interventions Record 
22/09/2010 3. Nightduty Record This is a new note, i call it note 1.

Find all instances in text, last word should also be beginning of search with regex for python

Answers (1)

Related Questions