Searching for substrings in a string that match a certain condition

Question

This is sort of a continuation of my other post: Extracting numbers from a string under certain conditions

To summarize, I have some strings that are stored in a dataframe and I want to extract the first number that matches all the conditions (if it exists). Here are the conditions:

The number CANNOT be at the start of the string
It CANNOT appear after the word "No. " or after the word "Question"
The number CANNOT be between the values 1960 - 2020
If the number is immediately followed by the letter e, I want to extract the e with it

This is what I have so far to find the numbers, it takes care of the first two conditions:

for index, row in df.iterrows():
    test = re.search(r'(?!^)(?



I've also tried using:

\b(?!196[0-9]\d|20[012][0])\d+\b


to account for the the number not being between the values of 1960 - and 2020, but it doesn't seem to work. I also don't understand how to catch the e if it's there.

Example 1: 

"Trial No. 32819 Question 485 Article 787e"


I would want the regex expression to return

[787e]


Example 2:

"2981 XYZ Legislature"


I would want the regex expression to return

None


Example 3"

"Addendum217Null"


I would want the regex expression to return

[217]


Thanks in advance for any help!

Wiktor Stribiżew · Accepted Answer

You may use

(?!^)(?



See the regex demo

The new part is (?:



(? - no digit allowed immediately to the left of the current location

(?!(?:19[6-9][0-9]|20[01][0-9]|2020)(?!\d)) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a number from 1960 to 2020 not followed with a digit
(\d+(?!\d)e?) - Group 1 (what you will get extracted): 1+ digits that are not followed with a digit and an optional e letter

Searching for substrings in a string that match a certain condition

Answers (1)

Related Questions