DataScience99
DataScience99

Reputation: 369

Searching for substrings in a string that match a certain condition

This is sort of a continuation of my other post: Extracting numbers from a string under certain conditions

To summarize, I have some strings that are stored in a dataframe and I want to extract the first number that matches all the conditions (if it exists). Here are the conditions:

This is what I have so far to find the numbers, it takes care of the first two conditions:

for index, row in df.iterrows():
    test = re.search(r'(?!^)(?<!\bNo\.\s)(?<!\bQuestion\s)(\d+)(?!\d)',
                     row['name'])
    if test:
        df.loc[
            df['name'] == row['name'], ['id']] = test.group()

I've also tried using:

\b(?!196[0-9]\d|20[012][0])\d+\b

to account for the the number not being between the values of 1960 - and 2020, but it doesn't seem to work. I also don't understand how to catch the e if it's there.

Example 1:

"Trial No. 32819 Question 485 Article 787e"

I would want the regex expression to return

[787e]

Example 2:

"2981 XYZ Legislature"

I would want the regex expression to return

None

Example 3"

"Addendum217Null"

I would want the regex expression to return

[217]

Thanks in advance for any help!

Upvotes: 1

Views: 367

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

You may use

(?!^)(?<!\bNo\.\s)(?<!\bQuestion\s)(?<!\d)(?!(?:19[6-9][0-9]|20[01][0-9]|2020)(?!\d))(\d+(?!\d)e?)

See the regex demo

The new part is (?<!\d)(?!(?:19[6-9][0-9]|20[01][0-9]|2020)(?!\d))(\d+(?!\d)e?):

  • (?<!\d) - no digit allowed immediately to the left of the current location
  • (?!(?:19[6-9][0-9]|20[01][0-9]|2020)(?!\d)) - a negative lookahead that fails the match if, immediately to the right of the current location, there is a number from 1960 to 2020 not followed with a digit
  • (\d+(?!\d)e?) - Group 1 (what you will get extracted): 1+ digits that are not followed with a digit and an optional e letter

Upvotes: 5

Related Questions