Reputation: 2619

Removing all non-numeric characters from string in Python

How do we remove all non-numeric characters from a string in Python?

Upvotes: 256

Answers (10)

theRPGmaster

Reputation: 131

In addition to Mark's answer, if you need to convert multiple strings, you can create a lambda:

make_num = lambda s: ''.join([c for c in s if c.isdigit()])
string_a = make_num(string_a)
string_b = make_num(string_b)

Upvotes: 0

Ned Batchelder

Reputation: 376012

>>> import re
>>> re.sub("[^0-9]", "", "sdkjh987978asd098as0980a98sd")
'987978098098098'
>>> # or
>>> re.sub(r"\D", "", "sdkjh987978asd098as0980a98sd")
'987978098098098'

Upvotes: 424

Mark Rushakoff

Reputation: 258478

Not sure if this is the most efficient way, but:

>>> ''.join(c for c in "abc123def456" if c.isdigit())
'123456'

The ''.join part means to combine all the resulting characters together without any characters in between. Then the rest of it is a generator expression, where (as you can probably guess) we only take the parts of the string that match the condition isdigit.

Upvotes: 151

hlongmore

Reputation: 1856

There are a lot of correct answers here. Some are faster or slower than others. The approach used in Ehsan Akbaritabar's and tzot's answers, filter with str.isdigit, is really fast; as is translate, from Alex Martelli's answer, once the setup is done. These are the two fastest methods. However, if you are only doing the substitution once, the setup penalty for translate is significant.

Which way is the best may depend on your use case. One replacement in a unit test? I'd go for filter using isdigit. It requires no imports, uses builtins only, and is quick and easy:

''.join(filter(str.isdigit, string_to_filter))

In a pandas or pyspark DataFrame, with millions of rows, the efficiency of translate is probably worth it, if you don't use the methods the DataFrame provides (which tend to rely on regex).

If you want to take the use translate approach, I'd recommend some changes for Python 3:

import string

unicode_non_digits = dict.fromkeys(
    [x for x in range(65536) if chr(x) not in string.digits]
)
string_to_filter.translate(unicode_non_digits)

Method	Loops	Repeats	Best of result per loop
`filter using isdigit`	1000	15	0.83 usec
`generator using isdigit`	1000	15	1.6 usec
`using re.sub`	1000	15	1.94 usec
`generator testing membership in digits`	1000	15	1.23 usec
`generator testing membership in digits set`	1000	15	1.19 usec
`use translate`	1000	15	0.797 usec
`use re.compile`	1000	15	1.52 usec
`use translate but make translation table every time`	20	5	1.21e+04 usec

That last row in the table is to show the setup penalty for translate. I used the default number and repeat options when creating the translation table every time, otherwise it takes too long.

The raw output from my timing script:

/bin/zsh /Users/henry.longmore/Library/Application\ Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh
+/Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh:6> which python
/Users/henry.longmore/.pyenv/shims/python
+/Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh:7> python --version
Python 3.10.6
+/Users/henry.longmore/Library/Application Support/JetBrains/PyCharm2022.2/scratches/scratch_4.sh:8> set +x
-----filter using isdigit
1000 loops, best of 15: 0.83 usec per loop
-----generator using isdigit
1000 loops, best of 15: 1.6 usec per loop
-----using re.sub
1000 loops, best of 15: 1.94 usec per loop
-----generator testing membership in digits
1000 loops, best of 15: 1.23 usec per loop
-----generator testing membership in digits set
1000 loops, best of 15: 1.19 usec per loop
-----use translate
1000 loops, best of 15: 0.797 usec per loop
-----use re.compile
1000 loops, best of 15: 1.52 usec per loop
-----use translate but make translation table every time
     using default number and repeat, otherwise this takes too long
20 loops, best of 5: 1.21e+04 usec per loop

The script I used for the timings:

NUMBER=1000
REPEAT=15
UNIT="usec"
TEST_STRING="abc123def45ghi6789"
set -x
which python
python --version
set +x
echo "-----filter using isdigit"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT "''.join(filter(str.isdigit, '${TEST_STRING}'))"
echo "-----generator using isdigit"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT "''.join(c for c in '${TEST_STRING}' if c.isdigit())"
echo "-----using re.sub"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="import re" "re.sub('[^0-9]', '', '${TEST_STRING}')"
echo "-----generator testing membership in digits"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="from string import digits" "''.join(c for c in '${TEST_STRING}' if c in digits)"
echo "-----generator testing membership in digits set"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="from string import digits; digits = {*digits}" "''.join(c for c in '${TEST_STRING}' if c in digits)"
echo "-----use translate"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="import string; unicode_non_digits = dict.fromkeys([x for x in range(65536) if chr(x) not in string.digits])" "'${TEST_STRING}'.translate(unicode_non_digits)"
echo "-----use re.compile"
python -m timeit --number=$NUMBER --repeat=$REPEAT --unit=$UNIT --setup="import re; digit_filter = re.compile('[^0-9]')" "digit_filter.sub('', '${TEST_STRING}')"
echo "-----use translate but make translation table every time"
echo "     using default number and repeat, otherwise this takes too long"
python -m timeit --unit=$UNIT --setup="import string" "unicode_non_digits = dict.fromkeys([x for x in range(65536) if chr(x) not in string.digits]); '${TEST_STRING}'.translate(unicode_non_digits)"

Upvotes: 4

Ehsan Akbaritabar

Reputation: 589

An easy way:

str.isdigit() returns True if str contains only numeric characters. Call filter(predicate, iterable) with str.isdigit as predicate and the string as iterable to return an iterable containing only the string's numeric characters. Call str.join(iterable) with the empty string as str and the result of filter() as iterable to join each numeric character together into one string.

For example:

a_string = "!1a2;b3c?"
numeric_filter = filter(str.isdigit, a_string)
numeric_string = "".join(numeric_filter)
print(numeric_string)

And the output is:

Upvotes: 4

tzot

Reputation: 96061

This should work for both strings and unicode objects in Python2, and both strings and bytes in Python3:

# python <3.0
def only_numerics(seq):
    return filter(type(seq).isdigit, seq)

# python ≥3.0
def only_numerics(seq):
    seq_type= type(seq)
    return seq_type().join(filter(seq_type.isdigit, seq))

Upvotes: 27

Alberto Ibarra

Reputation: 91

Many right answers but in case you want it in a float, directly, without using regex:

x= '$123.45M'

float(''.join(c for c in x if (c.isdigit() or c =='.'))

123.45

You can change the point for a comma depending on your needs.

change for this if you know your number is an integer

x='$1123'    
int(''.join(c for c in x if c.isdigit())

1123

Upvotes: 9

kennyut

Reputation: 3831

@Ned Batchelder and @newacct provided the right answer, but ...

Just in case if you have comma(,) decimal(.) in your string:

import re
re.sub("[^\d\.]", "", "$1,999,888.77")
'1999888.77'

Upvotes: 27

Tim McNamara

Reputation: 18385

Just to add another option to the mix, there are several useful constants within the string module. While more useful in other cases, they can be used here.

>>> from string import digits
>>> ''.join(c for c in "abc123def456" if c in digits)
'123456'

There are several constants in the module, including:

ascii_letters (abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ)
hexdigits (0123456789abcdefABCDEF)

If you are using these constants heavily, it can be worthwhile to covert them to a frozenset. That enables O(1) lookups, rather than O(n), where n is the length of the constant for the original strings.

>>> digits = frozenset(digits)
>>> ''.join(c for c in "abc123def456" if c in digits)
'123456'

Upvotes: 11

Alex Martelli

Reputation: 882691

Fastest approach, if you need to perform more than just one or two such removal operations (or even just one, but on a very long string!-), is to rely on the translate method of strings, even though it does need some prep:

>>> import string
>>> allchars = ''.join(chr(i) for i in xrange(256))
>>> identity = string.maketrans('', '')
>>> nondigits = allchars.translate(identity, string.digits)
>>> s = 'abc123def456'
>>> s.translate(identity, nondigits)
'123456'

The translate method is different, and maybe a tad simpler simpler to use, on Unicode strings than it is on byte strings, btw:

>>> unondig = dict.fromkeys(xrange(65536))
>>> for x in string.digits: del unondig[ord(x)]
... 
>>> s = u'abc123def456'
>>> s.translate(unondig)
u'123456'

You might want to use a mapping class rather than an actual dict, especially if your Unicode string may potentially contain characters with very high ord values (that would make the dict excessively large;-). For example:

>>> class keeponly(object):
...   def __init__(self, keep): 
...     self.keep = set(ord(c) for c in keep)
...   def __getitem__(self, key):
...     if key in self.keep:
...       return key
...     return None
... 
>>> s.translate(keeponly(string.digits))
u'123456'
>>>

Upvotes: 5

Removing all non-numeric characters from string in Python

Answers (10)

Related Questions