Reputation: 7338
I have a class that defines a function define_stop_words
that returns a list of string tokens. I then went to apply another function called remove_stopwords
, which takes raw utf8 text as input, to a pandas dataframe df
which contains text. The code looks something like this
class ProcessText:
def __init__(self, flag):
self.flag = flag # not important for this question
def define_stop_words(self):
names = ['john','sally','billy','sarah']
stops = ['if','the','in','then','an','a']
return stops+names
def remove_stopwords(self, text):
return [word for word in text if word not in self.define_stop_words()]
import pandas as pd
df = pd.read_csv('data.csv')
parse = ProcessText(flag=True)
df['text'] = df['text'].apply(parse.remove_stopwords())
My question is, will the remove_stopwords
function call on and define the variable returned by define_stop_words
every time - for every word in text
for every row in df
(for every iteration basically)?
If this is the case, I don't want it to run like this, as it would be very slow and inefficient. I want to define the variable returned by define_stop_words
once, almost like a "global variable" within the ProcessText
class, and then use that variable in remove_stopwords
multiple times (for every word and row in df
).
Is there a way to do this - should this be done? What is the best practice in this case?
Upvotes: 1
Views: 123
Reputation: 7186
The define_stop_words
method will be called only once every time you call the remove_stopwords
method.
One way to call it only exactly once per instance, but not when initializing the instance (because you might have many of those methods, all of which are expensive, and you don't always need all of them), is to use something like this:
class ProcessText:
def __init__(self, flag):
self.flag = flag # not important for this question
self._stop_words = None
@property
def stop_words(self):
if self._stop_words is None:
self._stop_words = set(['john','sally','billy','sarah'])
self._stop_words |= set(['if','the','in','then','an','a'])
return self._stop_words
def remove_stopwords(self, text):
return [word for word in text if word not in self.define_stop_words]
Upvotes: 2
Reputation: 462
You can assign those names to class variables as
class ProcessText:
names = ['john','sally','billy','sarah']
stops = ['if','the','in','then','an','a']
def __init__(self, flag):
self.flag = flag # not important for this question
def remove_stopwords(self, text):
return [word for word in text if word not in self.names + self.stops]
import pandas as pd
df = pd.read_csv('data.csv')
parse = ProcessText(flag=True)
df['text'] = df['text'].apply(parse.remove_stopwords())
Those class variables is inherited by all the instances. Assignment in __init__() method will result in multiple assignments each time new instance is created.
Upvotes: 3
Reputation: 942
You could cache the listed words, setting them in the init so that the operation only gets called once. Then instead of using a define_stop_words() function, you would have this as a property.
class ProcessText:
def __init__(self, flag):
self.flag = flag # not important for this question
self._names = ['john','sally','billy','sarah']
self._stops = ['if','the','in','then','an','a']
@property
def define_stop_words(self):
return self._stops + self._names
def remove_stopwords(self, text):
return [word for word in text if word not in self.define_stop_words]
Note that in python, there is no real concept of a private variable (which is what I think you want to use here - you don't want the user to be able to overwrite these lists once created?). This means that an unscrupulous user of your code could still update the _names and _stops attributes within the ProcessText object after the initialiser, meaning that you get unexpected results.
Another thing to consider is to use a set instead of a list (especially if performance is an issue) as the hashing will be faster.
Of course it would be faster again to combine the lists and cache the combined set instead of performing the 'add' on each call to the property too (so that the property call simply returns a cached set) if you were nitpicking further!
e.g
class ProcessText:
def __init__(self, flag):
self.flag = flag # not important for this question
_names = {'john','sally','billy','sarah'}
_stops = {'if','the','in','then','an','a'}
self._stops_and_names = _names.union(_stops)
@property
def define_stop_words(self):
return self._stops_and_names
def remove_stopwords(self, text):
return [word for word in text if word not in self.define_stop_words]
Upvotes: 2