Reputation: 2619
How do we remove all non-numeric characters from a string in Python?
Upvotes: 256
Views: 324022
Reputation: 131
In addition to Mark's answer, if you need to convert multiple strings, you can create a lambda:
make_num = lambda s: ''.join([c for c in s if c.isdigit()])
string_a = make_num(string_a)
string_b = make_num(string_b)
Upvotes: 0
Reputation: 376012
>>> import re
>>> re.sub("[^0-9]", "", "sdkjh987978asd098as0980a98sd")
'987978098098098'
>>> # or
>>> re.sub(r"\D", "", "sdkjh987978asd098as0980a98sd")
'987978098098098'
Upvotes: 424
Reputation: 258478
Not sure if this is the most efficient way, but:
>>> ''.join(c for c in "abc123def456" if c.isdigit())
'123456'
The ''.join
part means to combine all the resulting characters together without any characters in between. Then the rest of it is a generator expression, where (as you can probably guess) we only take the parts of the string that match the condition isdigit
.
Upvotes: 151
Reputation: 1856
There are a lot of correct answers here. Some are faster or slower than others. The approach used in Ehsan Akbaritabar's and tzot's answers, filter with str.isdigit, is really fast; as is translate, from Alex Martelli's answer, once the setup is done. These are the two fastest methods. However, if you are only doing the substitution once, the setup penalty for translate is significant.
Which way is the best may depend on your use case. One replacement in a unit test? I'd go for filter using isdigit
. It requires no imports, uses builtins only, and is quick and easy:
''.join(filter(str.isdigit, string_to_filter))
In a pandas or pyspark DataFrame, with millions of rows, the efficiency of translate is probably worth it, if you don't use the methods the DataFrame provides (which tend to rely on regex).
If you want to take the use translate
approach, I'd recommend some changes for Python 3:
import string
unicode_non_digits = dict.fromkeys(
[x for x in range(65536) if chr(x) not in string.digits]
)
string_to_filter.translate(unicode_non_digits)
Method | Loops | Repeats | Best of result per loop |
---|---|---|---|
filter using isdigit |
1000 | 15 | 0.83 usec |
generator using isdigit |
1000 | 15 | 1.6 usec |
using re.sub |
1000 | 15 | 1.94 usec |
generator testing membership in digits |
1000 | 15 | 1.23 usec |
generator testing membership in digits set |
1000 | 15 | 1.19 usec |
use translate |
1000 | 15 | 0.797 usec |
use re.compile |
1000 | 15 | 1.52 usec |
use translate but make translation table every time |
20 | 5 | 1.21e+04 usec |
That last row in the table is to show the setup penalty for translate. I used the default number and repeat options when creating the translation table every time, otherwise it takes too long.
The raw output from my timing script:
/bin/zsh /Users/henry.longmore/Library/Application\ Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh
+/Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh:6> which python
/Users/henry.longmore/.pyenv/shims/python
+/Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh:7> python --version
Python 3.10.6
+/Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh:8> set +x
-----filter using isdigit
1000 loops, best of 15: 0.83 usec per loop
-----generator using isdigit
1000 loops, best of 15: 1.6 usec per loop
-----using re.sub
1000 loops, best of 15: 1.94 usec per loop
-----generator testing membership in digits
1000 loops, best of 15: 1.23 usec per loop
-----generator testing membership in digits set
1000 loops, best of 15: 1.19 usec per loop
-----use translate
1000 loops, best of 15: 0.797 usec per loop
-----use re.compile
1000 loops, best of 15: 1.52 usec per loop
-----use translate but make translation table every time
using default number and repeat, otherwise this takes too long
20 loops, best of 5: 1.21e+04 usec per loop
The script I used for the timings:
NUMBER=1000
REPEAT=15
UNIT="usec"
TEST_STRING="abc123def45ghi6789"
set -x
which python
python --version
set +x
echo "-----filter using isdigit"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT "''.join(filter(str.isdigit, '${TEST_STRING}'))"
echo "-----generator using isdigit"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT "''.join(c for c in '${TEST_STRING}' if c.isdigit())"
echo "-----using re.sub"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="import re" "re.sub('[^0-9]', '', '${TEST_STRING}')"
echo "-----generator testing membership in digits"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="from string import digits" "''.join(c for c in '${TEST_STRING}' if c in digits)"
echo "-----generator testing membership in digits set"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="from string import digits; digits = {*digits}" "''.join(c for c in '${TEST_STRING}' if c in digits)"
echo "-----use translate"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="import string; unicode_non_digits = dict.fromkeys([x for x in range(65536) if chr(x) not in string.digits])" "'${TEST_STRING}'.translate(unicode_non_digits)"
echo "-----use re.compile"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="import re; digit_filter = re.compile('[^0-9]')" "digit_filter.sub('', '${TEST_STRING}')"
echo "-----use translate but make translation table every time"
echo " using default number and repeat, otherwise this takes too long"
python -m timeit --unit=$UNIT --setup="import string" "unicode_non_digits = dict.fromkeys([x for x in range(65536) if chr(x) not in string.digits]); '${TEST_STRING}'.translate(unicode_non_digits)"
Upvotes: 4
Reputation: 589
An easy way:
str.isdigit() returns True if str contains only numeric characters. Call filter(predicate, iterable) with str.isdigit as predicate and the string as iterable to return an iterable containing only the string's numeric characters. Call str.join(iterable) with the empty string as str and the result of filter() as iterable to join each numeric character together into one string.
For example:
a_string = "!1a2;b3c?"
numeric_filter = filter(str.isdigit, a_string)
numeric_string = "".join(numeric_filter)
print(numeric_string)
And the output is:
123
Upvotes: 4
Reputation: 96061
This should work for both strings and unicode objects in Python2, and both strings and bytes in Python3:
# python <3.0
def only_numerics(seq):
return filter(type(seq).isdigit, seq)
# python ≥3.0
def only_numerics(seq):
seq_type= type(seq)
return seq_type().join(filter(seq_type.isdigit, seq))
Upvotes: 27
Reputation: 91
Many right answers but in case you want it in a float, directly, without using regex:
x= '$123.45M'
float(''.join(c for c in x if (c.isdigit() or c =='.'))
123.45
You can change the point for a comma depending on your needs.
change for this if you know your number is an integer
x='$1123'
int(''.join(c for c in x if c.isdigit())
1123
Upvotes: 9
Reputation: 3831
@Ned Batchelder and @newacct provided the right answer, but ...
Just in case if you have comma(,) decimal(.) in your string:
import re
re.sub("[^\d\.]", "", "$1,999,888.77")
'1999888.77'
Upvotes: 27
Reputation: 18385
Just to add another option to the mix, there are several useful constants within the string
module. While more useful in other cases, they can be used here.
>>> from string import digits
>>> ''.join(c for c in "abc123def456" if c in digits)
'123456'
There are several constants in the module, including:
ascii_letters
(abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ)hexdigits
(0123456789abcdefABCDEF)If you are using these constants heavily, it can be worthwhile to covert them to a frozenset
. That enables O(1) lookups, rather than O(n), where n is the length of the constant for the original strings.
>>> digits = frozenset(digits)
>>> ''.join(c for c in "abc123def456" if c in digits)
'123456'
Upvotes: 11
Reputation: 882691
Fastest approach, if you need to perform more than just one or two such removal operations (or even just one, but on a very long string!-), is to rely on the translate
method of strings, even though it does need some prep:
>>> import string
>>> allchars = ''.join(chr(i) for i in xrange(256))
>>> identity = string.maketrans('', '')
>>> nondigits = allchars.translate(identity, string.digits)
>>> s = 'abc123def456'
>>> s.translate(identity, nondigits)
'123456'
The translate
method is different, and maybe a tad simpler simpler to use, on Unicode strings than it is on byte strings, btw:
>>> unondig = dict.fromkeys(xrange(65536))
>>> for x in string.digits: del unondig[ord(x)]
...
>>> s = u'abc123def456'
>>> s.translate(unondig)
u'123456'
You might want to use a mapping class rather than an actual dict, especially if your Unicode string may potentially contain characters with very high ord values (that would make the dict excessively large;-). For example:
>>> class keeponly(object):
... def __init__(self, keep):
... self.keep = set(ord(c) for c in keep)
... def __getitem__(self, key):
... if key in self.keep:
... return key
... return None
...
>>> s.translate(keeponly(string.digits))
u'123456'
>>>
Upvotes: 5