Samual
Samual

Reputation: 552

Combining Many Regex Operations Together

I am working on an NLP project of text processing using python in which I need to do a data cleaning before feature extractions. I am doing the cleaning of special characters and number separations with chars using regex operation but I am doing all these in many operations separately which is making it slow. I want to make it in as few as possible operations or in a faster way.

my code is as follows

def remove_special_char(x):
    if type(x) is str:
        x = x.replace('-', ' ').replace('(', ',').replace(')', ',')
        x = re.compile(r"\s+").sub(" ", x).strip()
        x = re.sub(r'[^A-Z a-z 0-9-,.x]+', '', x).lower()
        x = re.sub(r"([0-9]+(\.[0-9]+)?)",r" \1 ", x).strip()
        x = x.replace(",,",",")
        return x
    else:
        return x 

Can anyone help me?

Upvotes: 1

Views: 38

Answers (2)

Alain T.
Alain T.

Reputation: 42143

In addition to preparing the compiled patterns outside the function, you can gain some performance by using translate for all the one-to-one or one-to-none conversions:

import string
mappings     = {'-':' ', '(':',', ')':','}            # add more mappings as needed
mappings.update({ c:' ' for c in string.whitespace }) # white spaces become spaces
mappings.update({c:c.lower() for c in string.ascii_uppercase}) # set to lowercase
specialChars = str.maketrans(mappings)

def remove_special_char(x):
    x = x.translate(specialChars)
    ...
    return x

Upvotes: 3

AKX
AKX

Reputation: 168913

You have different replacement strings for the various operations, so you can't really merge them.

You can pre-compile all of the regexps beforehand though, but I suspect it won't make much of a difference:

paren_re = re.compile(r"[()]")
whitespace_re = re.compile(r"\s+")
ident_re = re.compile(r"[^A-Za-z0-9-,.x]+")
number_re = re.compile(r"([0-9]+(\.[0-9]+)?)")


def remove_special_char(x):
    if isinstance(x, str):
        x = x.replace("-", " ")
        x = paren_re.sub(",", x)
        x = whitespace_re.sub(" ", x)
        x = ident_re.sub("", x).lower()
        x = number_re.sub(r" \1 ", x).strip()
        x = x.replace(",,", ",")
    return x

Have you profiled your program to see that this is the bottleneck?

Upvotes: 2

Related Questions