Reputation: 369
This is sort of a continuation of my other post: Extracting numbers from a string under certain conditions
To summarize, I have some strings that are stored in a dataframe and I want to extract the first number that matches all the conditions (if it exists). Here are the conditions:
The number CANNOT be at the start of the string
It CANNOT appear after the word "No. " or after the word "Question"
The number CANNOT be between the values 1960 - 2020
If the number is immediately followed by the letter e, I want to extract the e with it
This is what I have so far to find the numbers, it takes care of the first two conditions:
for index, row in df.iterrows():
test = re.search(r'(?!^)(?<!\bNo\.\s)(?<!\bQuestion\s)(\d+)(?!\d)',
row['name'])
if test:
df.loc[
df['name'] == row['name'], ['id']] = test.group()
I've also tried using:
\b(?!196[0-9]\d|20[012][0])\d+\b
to account for the the number not being between the values of 1960 - and 2020, but it doesn't seem to work. I also don't understand how to catch the e if it's there.
Example 1:
"Trial No. 32819 Question 485 Article 787e"
I would want the regex expression to return
[787e]
Example 2:
"2981 XYZ Legislature"
I would want the regex expression to return
None
Example 3"
"Addendum217Null"
I would want the regex expression to return
[217]
Thanks in advance for any help!
Upvotes: 1
Views: 367
Reputation: 626738
You may use
(?!^)(?<!\bNo\.\s)(?<!\bQuestion\s)(?<!\d)(?!(?:19[6-9][0-9]|20[01][0-9]|2020)(?!\d))(\d+(?!\d)e?)
See the regex demo
The new part is (?<!\d)(?!(?:19[6-9][0-9]|20[01][0-9]|2020)(?!\d))(\d+(?!\d)e?)
:
(?<!\d)
- no digit allowed immediately to the left of the current location(?!(?:19[6-9][0-9]|20[01][0-9]|2020)(?!\d))
- a negative lookahead that fails the match if, immediately to the right of the current location, there is a number from 1960
to 2020
not followed with a digit(\d+(?!\d)e?)
- Group 1 (what you will get extracted): 1+ digits that are not followed with a digit and an optional e
letterUpvotes: 5