Giovanni S
Giovanni S

Reputation: 2110

Regex remove numerics between two strings

I am trying to retrieve a string, ignoring all numerics between two other strings using regex

In my example below, ABC, DEF and HIJ will always be the same value, and in the same position of the string being searched, but 123 and 456 will always be different values and lengths.

My example string is:

"ABC 123 456 DEF HIJ"

I am trying to be left with the result of:

"DEF"

I can do this with two lines by using

r'ABC (.*) HIJ' 

Which leaves me with 123 456 DEF where I could then:

r'[^0-9\s]' 

It seems like that should be possible, but I really can't seem to figure it out if it is.

Upvotes: 0

Views: 149

Answers (4)

John Machin
John Machin

Reputation: 82934

Based on "ABC, DEF and HIJ will always be the same value, and in the same position of the string being searched, but 123 and 456 will always be different values and lengths":

>>> re.sub("ABC \d+ \d+ DEF HIJ", "DEF", "foo1 ABC 12345 67890 DEF HIJ foo2")
'foo1 DEF foo2'

Upvotes: 0

Alwyn Schoeman
Alwyn Schoeman

Reputation: 466

How about the regular expression: (Updated due to the first comment)

([A-Za-z]+)\ [A-Za-z]+$

It will capture the first of 2 words split by a space at the end of the line.

import re

s = "ABC 123123123 1231231234 DEF HIJ"
pat = r'([A-Za-z]+)\ [A-Za-z]+$'
a = re.findall(pat,s)
print (a)

gives 'DEF'

To capture multiple words in that position you could modify the pattern to:

r'\ ([A-Za-z\ ]+)\ [A-Za-z]+$'

For an input of ABC 234234 46456456 DEF ZYX HIJ, this will give you 'DEF ZYX'.

If you want to enforce that the first string must be ABC and the last one HIJ, then the other answer by wim will do the trick.

Upvotes: 0

clarkema
clarkema

Reputation: 41

Depending on exactly what is fixed in your input data, you could try extracting the second "word", allowing for (and ignoring) intervening strings of digits with a pattern like this:

foo = "ABC 123 456 DEF 456 HIJ"
pat = r'\w+\s+[\d ]*(\w+)[\d ]*\w+'
re.findall(pat, foo)
['DEF']

Alternatively, regexps might not be the easiest way. You could use a single regexp to strip out all numeric characters, split the remaining string on whitespace, and take the second element.

Upvotes: 4

wim
wim

Reputation: 362756

In regex \d+ will match 1 or more digits (greedy).

>>> import re
>>> s = "ABC 123 456 DEF HIJ"
>>> pat = r'ABC \d+ \d+ (.*) HIJ'
>>> re.findall(pat, s)
['DEF']

Upvotes: 4

Related Questions