PyRsquared
PyRsquared

Reputation: 7338

How to define variables that will be used in every iteration of a function within a class in python?

I have a class that defines a function define_stop_words that returns a list of string tokens. I then went to apply another function called remove_stopwords, which takes raw utf8 text as input, to a pandas dataframe df which contains text. The code looks something like this

class ProcessText:

    def __init__(self, flag):
        self.flag = flag # not important for this question

    def define_stop_words(self):
        names = ['john','sally','billy','sarah']
        stops = ['if','the','in','then','an','a']
        return stops+names

    def remove_stopwords(self, text):
        return [word for word in text if word not in self.define_stop_words()]


 import pandas as pd
 df = pd.read_csv('data.csv')
 parse = ProcessText(flag=True)
 df['text'] = df['text'].apply(parse.remove_stopwords())

My question is, will the remove_stopwords function call on and define the variable returned by define_stop_words every time - for every word in text for every row in df (for every iteration basically)?

If this is the case, I don't want it to run like this, as it would be very slow and inefficient. I want to define the variable returned by define_stop_words once, almost like a "global variable" within the ProcessText class, and then use that variable in remove_stopwords multiple times (for every word and row in df).

Is there a way to do this - should this be done? What is the best practice in this case?

Upvotes: 1

Views: 123

Answers (3)

Graipher
Graipher

Reputation: 7186

The define_stop_words method will be called only once every time you call the remove_stopwords method.

One way to call it only exactly once per instance, but not when initializing the instance (because you might have many of those methods, all of which are expensive, and you don't always need all of them), is to use something like this:

class ProcessText:

    def __init__(self, flag):
        self.flag = flag # not important for this question
        self._stop_words = None

    @property
    def stop_words(self):
        if self._stop_words is None:
            self._stop_words = set(['john','sally','billy','sarah'])
            self._stop_words |= set(['if','the','in','then','an','a'])
        return self._stop_words

    def remove_stopwords(self, text):
        return [word for word in text if word not in self.define_stop_words]

Upvotes: 2

Laxmi Prasad
Laxmi Prasad

Reputation: 462

You can assign those names to class variables as

class ProcessText:
   names = ['john','sally','billy','sarah']
   stops = ['if','the','in','then','an','a']

   def __init__(self, flag):
       self.flag = flag # not important for this question

   def remove_stopwords(self, text):
       return [word for word in text if word not in self.names + self.stops]


import pandas as pd
df = pd.read_csv('data.csv')
parse = ProcessText(flag=True)
df['text'] = df['text'].apply(parse.remove_stopwords())

Those class variables is inherited by all the instances. Assignment in __init__() method will result in multiple assignments each time new instance is created.

Upvotes: 3

emmet02
emmet02

Reputation: 942

You could cache the listed words, setting them in the init so that the operation only gets called once. Then instead of using a define_stop_words() function, you would have this as a property.

class ProcessText:

    def __init__(self, flag):
        self.flag = flag # not important for this question
        self._names = ['john','sally','billy','sarah']
        self._stops = ['if','the','in','then','an','a']

    @property
    def define_stop_words(self):
        return self._stops + self._names

    def remove_stopwords(self, text):
        return [word for word in text if word not in self.define_stop_words]

Note that in python, there is no real concept of a private variable (which is what I think you want to use here - you don't want the user to be able to overwrite these lists once created?). This means that an unscrupulous user of your code could still update the _names and _stops attributes within the ProcessText object after the initialiser, meaning that you get unexpected results.

Another thing to consider is to use a set instead of a list (especially if performance is an issue) as the hashing will be faster.

Of course it would be faster again to combine the lists and cache the combined set instead of performing the 'add' on each call to the property too (so that the property call simply returns a cached set) if you were nitpicking further!

e.g

class ProcessText:

    def __init__(self, flag):
        self.flag = flag # not important for this question
        _names = {'john','sally','billy','sarah'}
        _stops = {'if','the','in','then','an','a'}
        self._stops_and_names = _names.union(_stops)

    @property
    def define_stop_words(self):
        return self._stops_and_names

    def remove_stopwords(self, text):
        return [word for word in text if word not in self.define_stop_words]

Upvotes: 2

Related Questions