Reputation: 4644
Below is an example of a test case:
inpoot = "A.p.p.l.e (45) Orange (5.11) Kiwi" # WE HAVE
outpoot = "A.p.p.l.e () Orange () Kiwi" # WE WANT
The only reason I spelled inpoot
incorrectly is because input
is a reserved language keyword.
One might think that the following would work:
import string
def kill_numbers(text: str) -> str:
text = str(text)
return "".join(filter(lambda ch: ch not in string.digits, text))
However, the decimal point (.
) in a decimal numbers will be preserved.
inpoot = "A.p.p.l.e (45) Orange T5.11T Kiwi 99 Apricot"
outpoot = kill_numbers(inpoot)
print(repr(outpoot))
# prints 'A.p.p.l.e () Orange T.T Kiwi'
# We want `TT` not `T.T`
# the output contains a stray decimal point.
outpoot = kill_numbers("Strawberry 3.145 Plum")
print(repr(outpoot))
# fails to delete the `.` in `3.145`
INPUT | BAD OUTPUT | DESIRED OUTPUT |
---|---|---|
"3.14" |
"." |
"" (empty string) |
So, how can we delete all numbers, including decimal numbers?
A substitution using regular expressions is theoretically possible.
import re
test_case = "(.4) A.p.p.l.e (44) Orange .... (4.44) Kiwi . . . . ."
result = re.sub("[0-9]+\.?[0-9]*|\.[0-9]+", "", test_case)
print(result) # () A.p.p.l.e () Orange .... () Kiwi . . . . .
The regular expression shown above works for that one test case, but not all test cases.
The table below shows how various regular expressions perform on various test inputs.
KEY FOR TABLE
-
means that the regex does NOT match the string+
means that the regex matches the entire stringmeh
means that the regex matches a small part of string, but not the whole thing.REGEX | ' 1 ' |
'2' |
'3' |
'365' |
'9.43' |
'-5000' |
'+10' |
'3.10.4' |
'0001' |
'.5' |
'.' |
'591.' |
'' |
'0x77F' |
'3.456e11' |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[0-9]+\\.?[0-9]*|\\.[0-9]+ |
- | - | - | - | - | meh | meh | meh | - | - | + | - | + | meh | meh |
[+-]?[0-9]+\\.?[0-9]*|\\.[0-9]+ |
- | - | - | - | - | - | - | meh | - | - | + | - | + | meh | meh |
[+-]?([0-9]+\\.?[0-9]*|\\.[0-9]+) |
- | - | - | - | - | - | - | meh | - | - | + | - | + | meh | meh |
[0-9]*\\.?[0-9]* |
meh | - | - | - | - | meh | meh | meh | - | - | - | - | - | meh | meh |
[0-9]+\\.?[0-9]+ |
+ | + | + | - | - | meh | meh | meh | - | + | + | meh | + | meh | meh |
[0-9]+\\.?[0-9]* |
- | - | - | - | - | meh | meh | meh | - | meh | + | - | + | meh | meh |
[0-9]*\\.?[0-9]+ |
- | - | - | - | - | meh | meh | meh | - | - | + | meh | + | meh | meh |
\\d+ |
- | - | - | - | meh | meh | meh | meh | - | meh | + | meh | + | meh | meh |
[0-9] |
- | - | - | meh | meh | meh | meh | meh | meh | meh | + | meh | + | meh | meh |
\\d |
- | - | - | meh | meh | meh | meh | meh | meh | meh | + | meh | + | meh | meh |
\\d* |
meh | - | - | - | meh | meh | meh | meh | - | meh | meh | meh | - | meh | meh |
The same table in ASCII form might be easier to read and understand:
' 1 ' '2' '3' '365' '9.43' '-5000' '+10' '3.10.4' '0001' '.5' '.' '591.' '' '0x77F' '3.456e11'
[0-9]+\.?[0-9]*|\.[0-9]+ - - - - - meh meh meh - - + - + meh meh
[+-]?[0-9]+\.?[0-9]*|\.[0-9]+ - - - - - - - meh - - + - + meh meh
[+-]?([0-9]+\.?[0-9]*|\.[0-9]+) - - - - - - - meh - - + - + meh meh
[0-9]*\.?[0-9]* meh - - - - meh meh meh - - - - - meh meh
[0-9]+\.?[0-9]+ + + + - - meh meh meh - + + meh + meh meh
[0-9]+\.?[0-9]* - - - - - meh meh meh - meh + - + meh meh
[0-9]*\.?[0-9]+ - - - - - meh meh meh - - + meh + meh meh
\d+ - - - - meh meh meh meh - meh + meh + meh meh
[0-9] - - - meh meh meh meh meh meh meh + meh + meh meh
\d - - - meh meh meh meh meh meh meh + meh + meh meh
\d* meh - - - meh meh meh meh - meh meh meh - meh meh
In my humble opinion, regular expressions are a nightmare.
To digress, it took me a long time to realize that:
IMHO = In my humble opinion`. I don't speak acronym very well.
Back to business...
I cannot find a regex which satisfies the following requirements:
""
)"3.10.4"
At most one decimal point is allowed to appear in what we call a "number""."
).Desired behavior is as follows:
PSEUDO-NUMBER | IS_A_NUMBER() |
NOTES |
---|---|---|
"1" |
Yes | int |
"2" |
Yes | int |
"365" |
Yes | int |
"365." |
No | 365. is a float equivalent to 365.0 However, I do not want to delete the (. ) at the end of the string "The number of houses was 44." |
"9.43" |
Yes | one decimal points |
"-5000" |
Yes | |
"+10" |
Yes | |
"0001" |
Yes | |
".5" |
Yes | .5 is equivalent to 0.5 |
"1" |
Yes | |
"0x77F" |
Yes | |
"3.456e11" |
Yes | pseudo-scientific-notation |
"3.10.4" |
Not a number | two decimals points |
"." |
Not a number | |
"" |
Not a number | do not match the empty string |
The following are defined to be seed numbers ...
(1
, 365
, 9.43
, -5000
, +10
, 0001
, .5
, .5
, 0x77F
, 3.456e11
)
A valid number is defined to be any seed number or a string formed by a seed number by doing one of the following:
99
F
in 0xF
with 2F
or F2
or A
,B
,C
,D
, or E
.For example, you could replace the 5
in -5000
with 9
to get -9000
Also, you could replace the 5
in .5
with 99
to get .99
The above defines language L.
My question could be re-worded as follows:
What algorithm A will return s′ from input string s such that:
A substring t of string s is maximal and t is in language L if it is not possible to tack on one more character to the left or to the right of t to form t′, such that t′ is a string in language L and t′ is a substring of s.
In layman's terms, if you see "apple 12.345" you should go after "12.345" not "2.34".
Indices matter. Sometimes, it makes no sense to say that the letter "a"
is a sub-string of "abracadabra"
. Which letter "a" is it? It it the letter "a" third-from-the-left, or second-from-the left?
We define a string to a mathematical mapping M from a finite subset of the natural numbers to the ASCii character set such that the absolute difference between the maximum of the domain of mapping M and the minimum of the domain of mapping M is the sum of one and the cardinality of the domain of mapping M.
For any string SML and any string LRG, we say that SML is a sub-string of LRG if and only if SML[k] = LRG[k] for all k in the domain of string SML
Upvotes: 0
Views: 166
Reputation: 106553
You can use negative lookarounds to avoid undesired corner cases. Use alternation patterns to include incompatible patterns such as hexadecimal numbers:
[+-]?(?:(?:\b(?<!\d\.)\d+(?:\.\d+)?|(?<!\d)\.\d+)(?!\.)(?:e\d+)?|\b0x[0-9A-F]+)\b
Demo: https://regex101.com/r/HXxct5/2
Upvotes: 2
Reputation: 61
>>> import re
>>> inpoot = "A.p.p.l.e (45) Orange (5.11) Kiwi" # WE HAVE
>>> pattern = re.compile(r"\d+\.?\d*")
>>> re.sub(pattern, "", inpoot)
'A.p.p.l.e () Orange () Kiwi'
>>>
Upvotes: 0