How do I search with regex and avoid entries from a list?

Question

I have a long list of entries in a file in the following format:

""

e.g.

12345 = "Section 3 is ready for review"

24680 = "Bob to review Chapter 4"

I need to find a way of inserting additional text at the beginning of the word/phrase/sentence, but only if it doesn't start with one of several key words.

Additional text: 'Complete: '

List of key words: key_words_list = ['Section', 'Page', Heading']

e.g.

12345 = "Section 3 is ready for review" (no changes needed - sentence starts with 'Section' which is in the list)

24680 = "Complete: Bob to review Chapter 4" ('Complete: ' added to start of sentence because first word wasn't in list)

This could be done with a lot of string splitting and if statements but regex seems like it should be a more concise and much neater solution. I have the following that doesn't take account of the list:

for line in lines:
    line = re.sub('(^\s\s[0-9]+\s=\s")', r'\1Complete: ', line)

I also have some code that manages to identify the lines that require changes:

print([w for w in re.findall('^\s\s[0-9]+\s=\s"([\w+=?\s?,?.?]+)"', line) if w not in key_words_list])

Is regex the best option for what I need and if so, what am I missing?

Example inputs:

12345 = "Section 3 is ready for review"

24680 = "Bob to review Chapter 4"

Example outputs:

12345 = "Section 3 is ready for review"

24680 = "Complete: Bob to review Chapter 4"

Wiktor Stribiżew · Accepted Answer

You can use a regex like

^\s{2}[0-9]+\s=\s"(?!(?:Section|Page|Heading)\b)

See the regex demo. Details:

^ - start of string
\s{2} - two whitespaces
[0-9]+ - one or more digits
\s=\s - a = enclosed with a single whitespace on both ends
" - a " char
(?!(?:Section|Page|Heading)\b) - a negative lookahead that fails the match if there is Section, Page or Heading whole word immediately to the right of the current location.

See the Python demo:

import re
texts = ['  12345 = "Section 3 is ready for review"', '  24680 = "Bob to review Chapter 4"']
add = 'Complete: '
key_words_list = ['Section', 'Page', 'Heading']
pattern = re.compile(fr'^\s{{2}}[0-9]+\s=\s"(?!(?:{"|".join(key_words_list)})\b)')
for text in texts:
    print(pattern.sub(fr'\g<0>{add}', text))

# =>   12345 = "Section 3 is ready for review"
#      24680 = "Complete: Bob to review Chapter 4"

How do I search with regex and avoid entries from a list?

Answers (1)

Related Questions