Reputation: 570
I have several functions that work on strings. While they take different kinds of arguments, they all take one common argument called tokenizer_func
(defaulted to str.split
) which basically splits the input string into a list of tokens according to the provided function. The list of strings that are returned are then modified in each function. Since the tokenizer_func
seems to be a common argument and is the very first line of code that is present in all the functions, I was wondering if it would be easier to use a decorator to decorate the string modification functions. Basically, the decorator would taken the tokenizer_func
, apply it to the incoming string and call the appropriate string modification function.
Edit-2
I was able to find a solution (maybe hacky?):
def tokenize(f):
def _split(text, tokenizer=SingleSpaceTokenizer()):
return tokenizer.decode(f(tokenizer.encode(text)))
return _split
@tokenize
def change_first_letter(token_list, *_):
return [random.choice(string.ascii_letters) + token[1:] for token in token_list]
This way I can call change_first_letter(text)
to use the default tokenizer and change_first_letter(text, new_tokenizer)
to use the new_tokenizer
. If there is a better way, please let me know.
Edit-1:
After viewing the first reply to this question, I thought I could generalize the problem I a bit better to handle more involved tokenizers. Specifically, I now have this:
class Tokenizer(ABC):
"""
Base class for Tokenizer which provides the encode and decode methods
"""
def __init__(self, tokenizer: Any) -> None:
self.tokenizer = tokenizer
@abstractmethod
def encode(self, text: str) -> List[str]:
"""
Tokenize a string into list of strings
:param datum: Text to be tokenized
:return: List of tokens
"""
@abstractmethod
def decode(self, token_list : List[str]) -> str:
"""
Creates a string from a tokens list using the tokenizer
:param data: List of tokens
:return: Reconstructed string from token list
"""
def encode_many(self, texts: List[str]) -> List[List[str]]:
"""
Encode multiple strings
:param data: List of strings to be tokenized
:return: List of tokenized strings
"""
return [self.encode(text) for text in texts]
def decode_many(self, token_lists: List[List[str]]) -> List[str]:
"""
Decode multiple strings
:param data: List of tokenized strings
:return: List of reconstructed strings
"""
return [self.decode(token_list) for token_list in token_lists]
class SingleSpaceTokenizer(Tokenizer):
"""
Simple tokenizer that just splits a string on a single space using str.split
"""
def __init__(self, tokenizer=None) -> None:
super(SingleSpaceTokenizer, self).__init__(tokenizer)
def encode(self, text: str) -> List[str]:
return text.split()
def decode(self, token_list: List[str]) -> str:
return ' '.join(token_list)
I've written a decorator function based on a reply and search:
def tokenize(tokenizer):
def _tokenize(f):
def _split(text):
response = tokenizer.decode(f(tokenizer.encode(text)))
return response
return _split
return _tokenize
Now I am able to do this:
@tokenize(SingleSpaceTokenizer())
def change_first_letter(token_list):
return [random.choice(string.ascii_letters) + token[1:] for token in token_list]
This works without any problems. How lets I as a user want to use another tokenizer:
class AtTokenizer(Tokenizer):
def __init__(self, tokenizer=None):
super(AtTokenizer, self).__init__(tokenizer)
def encode(self, text):
return text.split('@')
def decode(self, token_list):
return '@'.join(token_list)
new_tokenizer = AtTokenizer()
How would I invoke my text functions by passing this new_tokenzer
?
I found out that I can call this new_tokenizer
like this:
tokenize(new_tokenizer)(change_first_letter)(text)
if I DO NOT decorate the change_first_letter
function. This seems very tedious though? Is there a way to do this more concisely?
Original:
Here is an example of two such functions (the first one is a dummy function):
def change_first_letter(text: str, tokenizer_func: Callable[[str], List[str]]=str.split) -> str:
words = tokenizer_func(text)
return ' '.join([random.choice(string.ascii_letters) + word[1:] for word in words])
def spellcheck(text: str, tokenizer_func: Callable[[str], List[str]]=str.split) -> str:
words = tokenizer_func(text)
return ' '.join([SpellChecker().correction(word) for word in words])
As you can for both functions the first line is to apply the tokenizer function. If the tokenizer function is always str.split
, the I could create a decorator that would do this for me:
def tokenize(func):
def _split(text):
return func(text.split())
return _split
Then I could just decorate the other functions with @tokenize
and it would work. In this case, the functions would directly take List[str]
. However, the tokenizer_func
is provided by the function caller. How would I pass this to the decorator? Can this be done?
Upvotes: 1
Views: 478
Reputation: 10452
def tokenize(tokenizer):
def _tokenize(f):
def _split(text, tokenizer=tokenizer):
response = tokenizer.decode(f(tokenizer.encode(text)))
return response
return _split
return _tokenize
That way you can call your change_first_letter
in two ways:
change_first_letter(text)
to use the default tokenizerchange_first_letter(text, new_tokenizer)
to use new_tokenizer
MyPy doesn't like it when decorators change which parameters a function accepts, so if you're using MyPy you might want to write a plugin for it.
Upvotes: 1
Reputation: 24691
The @
syntax of a decorator simply evaluates the rest of the line as a function, calls that function on the function that's defined immediately afterwards, and replaces it as such. By making the 'decorator with arguments' (tokenize()
) return a regular decorator, that decorator will then encompass the original function.
def tokenize(method):
def decorator(function):
def wrapper(text):
return function(method(text))
return wrapper
return decorator
@tokenize(method=str.split)
def strfunc(text):
print(text)
strfunc('The quick brown fox jumped over the lazy dog')
# ['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']
The problem with this is that, if you were to assign a default argument (e.g. def tokenize(method=str.split):
), you'd still need to call it as a function when applying the decorator:
@tokenize()
def strfunc(text):
...
so it might be best to not give a default argument, or to find a creative way around this problem. One possible solution would be to change the decorator's behavior depending on whether it's called with a function (in which case it decorates that function) or a string (in which case it calls str.split()
):
def tokenize(method):
def decorator(arg):
# if argument is a function, then apply another decorator
# otherwise, assume str.split()
if type(arg) == type(tokenize):
def wrapper(text):
return arg(method(text))
return wrapper
else:
return method(str.split(arg))
return decorator
which should allow both of the following:
@tokenize # default to str.split
def strfunc(text):
...
@tokenize(str.split) # or another function of your choice
def strfunc(text):
...
The downside to this is that it's a bit hacky (playing with type()
always is, and the saving grace here in particular is that all functions are functions; you could instead see if you could do a check for "is callable", if you wanted it to apply to classes as well, maybe), and makes it hard to figure out which parameters are doing what inside tokenize()
- since they change purposes depending on how the method is called.
Upvotes: 0