Obcure
Obcure

Reputation: 1001

Removing non numeric characters from a string in Python

I have been given the task to remove all non numeric characters including spaces from either a text file or a string and then print the new result, for example:

Before:

sd67637 8

After:

676378

As I am a beginner I do not know where to start with this task.

Upvotes: 69

Views: 117295

Answers (9)

MarcusAurelius
MarcusAurelius

Reputation: 73

Convert all numeric strings with or without unit abbreviations. You must indicate that the source string is a decimal comma notation by parameter dec=',' Converting to floats as well as integer is possible. Default conversion is float, but set the parameter toInt=True and the result is an integer. Automatic recognition of unit abbreviations that can be edited in the md dictionary. The key is the unit abbreviation and the value is the multiplier. In this way, the applications of this function are endless. The result is always a number you can calculate with. This all in one function is not the fastest method, but you don't have to worry anymore and it always returns a reliable result.

import re
'''
units: gr=grams, K=thousands, M=millions, B=billions, ms=mili-seconds, mt= metric-tonnes
'''
md = {'gr': 0.001, '%': 0.01, 'K': 1000, 'M': 1000000, 'B': 1000000000, 'ms': 0.001, 'mt': 1000}
seps = {'.': True, ',': False}
kl = list(md.keys())

def to_Float_or_Int(strVal, toInt=None, dec=None):
    toInt = False if toInt is None else toInt
    dec = '.' if dec is None else dec

    def chck_char_in_string(strVal):
        rs = None
        for el in kl:
            if el in strVal:
                rs = el
                break
        return rs

    if dec in seps.keys():
        dcp = seps[dec]
        strVal = strVal.strip()
        mpk = chck_char_in_string(strVal)
        mp = 1 if mpk is None else md[mpk]
        strVal = re.sub(r'[^\de.,-]+', '', strVal)
        if dcp:
           strVal = strVal.replace(',', '')
        else:
            strVal = strVal.replace('.', '')
            strVal = strVal.replace(',', '.')
        dcnm = float(strVal)
        dcnm = dcnm * mp
        dcnm = int(round(dcnm)) if toInt else dcnm
    else:
        print('wrong decimal separator')
        dcnm = None
    return dcnm

Call the function as follows:

pvals = ['-123,456', '-45,145.01 K', '753,159.456', '1,000,000', '985 ms' , '888 745.23', '1.753 e-04']
cvals = ['-123,456', '1,354852M', '+10.000,12 gr', '-87,24%', '10,2K', '985 ms', '(mt) 0,475', ' ,159']
print('decimal point strings')
for val in pvals:
    result = to_Float_or_Int(val)
    print(result)
print()
print('decimal comma strings')
for val in cvals:
    result = to_Float_or_Int(val, dec=',')
    print(result)
exit()

The output results:

decimal point strings
-123456.0
-45145010.0
753159.456
1000000.0
0.985
888745.23
0.0001753

decimal comma strings
-123.456
1354852.0
10.00012
-0.8724
10200.0
0.985
475.0
0.159

Upvotes: 0

Sev
Sev

Reputation: 87

import re
result = re.sub('\D','','sd67637 8')

result >>> '676378'

Upvotes: 1

Daniel Morell
Daniel Morell

Reputation: 2596

I would not use RegEx for this. It is a lot slower!

Instead let's just use a simple for loop.

TLDR;

This function will get the job done fast...

def filter_non_digits(string: str) -> str:
    result = ''
    for char in string:
        if char in '1234567890':
            result += char
    return result 

The Explanation

Let's create a very basic benchmark to test a few different methods that have been proposed. I will test three methods...

  1. For loop method (my idea).
  2. List Comprehension method from Jon Clements' answer.
  3. RegEx method from Moradnejad's answer.
# filters.py

import re

# For loop method
def filter_non_digits_for(string: str) -> str:
    result = ''
    for char in string:
        if char in '1234567890':
            result += char
    return result 


# Comprehension method
def filter_non_digits_comp(s: str) -> str:
    return ''.join(ch for ch in s if ch.isdigit())


# RegEx method
def filter_non_digits_re(string: str) -> str:
    return re.sub('[^\d]','', string)

Now that we have an implementation of each way of removing digits, let's benchmark each one.

Here is some very basic and rudimentary benchmark code. However, it will do the trick and give us a good comparison of how each method performs.

# tests.py

import time, platform
from filters import filter_non_digits_re,
                    filter_non_digits_comp,
                    filter_non_digits_for


def benchmark_func(func):
    start = time.time()
    # the "_" in the number just makes it more readable
    for i in range(100_000):
        func('afes098u98sfe')
    end = time.time()
    return (end-start)/100_000


def bench_all():
    print(f'# System ({platform.system()} {platform.machine()})')
    print(f'# Python {platform.python_version()}\n')

    tests = [
        filter_non_digits_re,
        filter_non_digits_comp,
        filter_non_digits_for,
    ]

    for t in tests:
        duration = benchmark_func(t)
        ns = round(duration * 1_000_000_000)
        print(f'{t.__name__.ljust(30)} {str(ns).rjust(6)} ns/op')


if __name__ == "__main__":
    bench_all()

Here is the output from the benchmark code.

# System (Windows AMD64)
# Python 3.9.8

filter_non_digits_re             2920 ns/op
filter_non_digits_comp           1280 ns/op
filter_non_digits_for             660 ns/op

As you can see the filter_non_digits_for() funciton is more than four times faster than using RegEx, and about twice as fast as the comprehension method. Sometimes simple is best.

Upvotes: 4

Kiprono Elijah Koech
Kiprono Elijah Koech

Reputation: 338

Adding into @MoradneJad . You can use the following code to extract integer values, floats and even signed values.

a = re.findall(r"[-+]?\d*\.\d+|\d+", "Over th44e same pe14.1riod of time, p-0.8rices also rose by 82.8p")

And then you can convert the list items to numeric data type effectively using map.

print(list(map(float, a)))

[44.0, 14.1, -0.8, 82.8]

Upvotes: 0

Moradnejad
Moradnejad

Reputation: 3673

To extract Integers

Example: sd67637 8 ==> 676378

import re
def extract_int(x):
    return re.sub('[^\d]','', x)

To extract a single float/int number (possible decimal separator)

Example: sd7512.sd23 ==> 7512.23

import re
def extract_single_float(x):
    return re.sub('[^\d|\.]','', x)

To extract multiple float/float numbers

Example: 123.2 xs12.28 4 ==> [123.2, 12.28, 4]

import re
def extract_floats(x):
    return re.findall("\d+\.\d+", x)

Upvotes: 1

Saullo G. P. Castro
Saullo G. P. Castro

Reputation: 59005

You can use string.ascii_letters to identify your non-digits:

from string import *

a = 'sd67637 8'
a = a.replace(' ', '')

for i in ascii_letters:
    a = a.replace(i, '')

In case you want to replace a colon, use quotes " instead of colons '.

Upvotes: 1

Inbar Rose
Inbar Rose

Reputation: 43497

There is a builtin for this.

string.translate(s, table[, deletechars])

Delete all characters from s that are in deletechars (if present), and then translate the characters using table, which must be a 256-character string giving the translation for each character value, indexed by its ordinal. If table is None, then only the character deletion step is performed.

>>> import string
>>> non_numeric_chars = ''.join(set(string.printable) - set(string.digits))
>>> non_numeric_chars = string.printable[10:]  # more effective method. (choose one)
'sd67637 8'.translate(None, non_numeric_chars)
'676378'

Or you could do it with no imports (but there is no reason for this):

>>> chars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'
>>> 'sd67637 8'.translate(None, chars)
'676378'

Upvotes: 11

mar mar
mar mar

Reputation: 1238

The easiest way is with a regexp

import re
a = 'lkdfhisoe78347834 (())&/&745  '
result = re.sub('[^0-9]','', a)

print result
>>> '78347834745'

Upvotes: 118

Jon Clements
Jon Clements

Reputation: 142226

Loop over your string, char by char and only include digits:

new_string = ''.join(ch for ch in your_string if ch.isdigit())

Or use a regex on your string (if at some point you wanted to treat non-contiguous groups separately)...

import re
s = 'sd67637 8' 
new_string = ''.join(re.findall(r'\d+', s))
# 676378

Then just print them out:

print(old_string, '=', new_string)

Upvotes: 32

Related Questions