Reputation: 552
I am working on an NLP project of text processing using python in which I need to do a data cleaning before feature extractions. I am doing the cleaning of special characters and number separations with chars using regex operation but I am doing all these in many operations separately which is making it slow. I want to make it in as few as possible operations or in a faster way.
my code is as follows
def remove_special_char(x):
if type(x) is str:
x = x.replace('-', ' ').replace('(', ',').replace(')', ',')
x = re.compile(r"\s+").sub(" ", x).strip()
x = re.sub(r'[^A-Z a-z 0-9-,.x]+', '', x).lower()
x = re.sub(r"([0-9]+(\.[0-9]+)?)",r" \1 ", x).strip()
x = x.replace(",,",",")
return x
else:
return x
Can anyone help me?
Upvotes: 1
Views: 38
Reputation: 42143
In addition to preparing the compiled patterns outside the function, you can gain some performance by using translate for all the one-to-one or one-to-none conversions:
import string
mappings = {'-':' ', '(':',', ')':','} # add more mappings as needed
mappings.update({ c:' ' for c in string.whitespace }) # white spaces become spaces
mappings.update({c:c.lower() for c in string.ascii_uppercase}) # set to lowercase
specialChars = str.maketrans(mappings)
def remove_special_char(x):
x = x.translate(specialChars)
...
return x
Upvotes: 3
Reputation: 168913
You have different replacement strings for the various operations, so you can't really merge them.
You can pre-compile all of the regexps beforehand though, but I suspect it won't make much of a difference:
paren_re = re.compile(r"[()]")
whitespace_re = re.compile(r"\s+")
ident_re = re.compile(r"[^A-Za-z0-9-,.x]+")
number_re = re.compile(r"([0-9]+(\.[0-9]+)?)")
def remove_special_char(x):
if isinstance(x, str):
x = x.replace("-", " ")
x = paren_re.sub(",", x)
x = whitespace_re.sub(" ", x)
x = ident_re.sub("", x).lower()
x = number_re.sub(r" \1 ", x).strip()
x = x.replace(",,", ",")
return x
Have you profiled your program to see that this is the bottleneck?
Upvotes: 2