Reputation: 2101
Suppose I have arrays of tuples like so:
a = [('shape', 'rectangle'), ('fill', 'no'), ('size', 'huge')]
b = [('shape', 'rectangle'), ('fill', 'yes'), ('size', 'large')]
I am trying to turn these arrays into numerical vectors with each dimension representing a feature.
So the expected output we be something like:
amod = [1, 0, 1] # or [1, 1, 1]
bmod = [1, 1, 2] # or [1, 2, 2]
So the vector that gets created is dependent on what it has seen before (i.e rectangle is still coded as 1
but the new value 'large' gets coded as a next step up as 2
).
I think I could use some combination of yield
and a memoize function to help me with this. This is what I've tried so far:
def memoize(f):
memo = {}
def helper(x):
if x not in memo:
memo[x] = f(x)
return memo[x]
return helper
@memoize
def verbal_to_value(tup):
u = 1
if tup[0] == 'shape':
yield u
u += 1
if tup[0] == 'fill':
yield u
u += 1
if tup[0] == 'size':
yield u
u += 1
But I keep getting this error:
TypeError: 'NoneType' object is not callable
Is there a way I can create this function that has a memory of what it has seen? Bonus points if it could add keys dynamically so I don't have to hardcode things like 'shape' or 'fill'.
Upvotes: 0
Views: 394
Reputation: 84
First off: this is my preferred implementation of the memoize decorator, mostly because of speed ...
def memoize(f):
class memodict(dict):
__slots__ = ()
def __missing__(self, key):
self[key] = ret = f(key)
return ret
return memodict().__getitem__
except for some a few edge cases it has the same effect as yours:
def memoize(f):
memo = {}
def helper(x):
if x not in memo:
memo[x] = f(x)
#else:
# pass
return memo[x]
return helper
but is somewhat faster because the if x not in memo:
happens in
native code instead of in python. To understand it you merely need
to know that under normal circumstances: to interpret adict[item]
python calls adict.__getitem__(key)
, if adict doesn't contain key,
__getitem__()
calls adict.__missing__(key)
so we can leverage the
python magic methods protocols for our gain...
#This the first idea I had how I would implement your
#verbal_to_value() using memoization:
from collections import defaultdict
work=defaultdict(set)
@memoize
def verbal_to_value(kv):
k, v = kv
aset = work[k] #work creates a new set, if not already created.
aset.add(v) #add value if not already added
return len(aset)
including the memoize decorator, that's 15 lines of code...
#test suite:
def vectorize(alist):
return [verbal_to_value(kv) for kv in alist]
a = [('shape', 'rectangle'), ('fill', 'no'), ('size', 'huge')]
b = [('shape', 'rectangle'), ('fill', 'yes'), ('size', 'large')]
print (vectorize(a)) #shows [1,1,1]
print (vectorize(b)) #shows [1,2,2]
defaultdict is a powerful object that has almost the same logic
as memoize: a standard dictionary in every way, except that when the
lookup fails, it runs the callback function to create the missing
value. In our case set()
Unfortunately this problem requires either access to the tupple that is being used as the key, or to the dictionary state itself. With the result that we cannot just write a simple function for .default_factory
But we can write a new object based on the memoize/defaultdict pattern:
#This how I would implement your verbal_to_value without
#memoization, though the worker class is so similar to @memoize,
#that it's easy to see why memoize is a good pattern to work from:
class sloter(dict):
__slots__ = ()
def __missing__(self,key):
self[key] = ret = len(self) + 1
#this + 1 bothers me, why can't these vectors be 0 based? ;)
return ret
from collections import defaultdict
work2 = defaultdict(sloter)
def verbal_to_value2(kv):
k, v = kv
return work2[k][v]
#~10 lines of code?
#test suite2:
def vectorize2(alist):
return [verbal_to_value2(kv) for kv in alist]
print (vectorize2(a)) #shows [1,1,1]
print (vectorize2(b)) #shows [1,2,2]
You might have seen something like sloter
before, because it's
sometimes used for exactly this sort of situation. Converting member
names to numbers and back. Because of this, we have the advantage of
being able to reverse things like this:
def unvectorize2(a_vector, pattern=('shape','fill','size')):
reverser = [{v:k2 for k2,v in work2[k].items()} for k in pattern]
for index, vect in enumerate(a_vector):
yield pattern[index], reverser[index][vect]
print (list(unvectorize2(vectorize2(a))))
print (list(unvectorize2(vectorize2(b))))
But I saw those yields in your original post, and they've got me
thinking... what if there was a memoize / defaultdict like object
that could take a generator instead of a function and knew to just
advance the generator rather than calling it. Then I realized ...
that yes generators come with a callable called __next__()
which
meant that we didn't need a new defaultdict implementation, just a
careful extraction of the correct member funtion...
def count(start=0): #same as: from itertools import count
while True:
yield start
start += 1
#so we could get the exact same behavior as above, (except faster)
#by saying:
sloter3=lambda :defaultdict(count(1).__next__)
#and then
work3 = defaultdict(sloter3)
#or just:
work3 = defaultdict(lambda :defaultdict(count(1).__next__))
#which yes, is a bit of a mindwarp if you've never needed to do that
#before.
#the outer defaultdict interprets the first item. Every time a new
#first item is received, the lambda is called, which creates a new
#count() generator (starting from 1), and passes it's .__next__ method
#to a new inner defaultdict.
def verbal_to_value3(kv):
k, v = kv
return work3[k][v]
#you *could* call that 8 lines of code, but we managed to use
#defaultdict twice, and didn't need to define it, so I wouldn't call
#it 'less complex' or anything.
#test suite3:
def vectorize3(alist):
return [verbal_to_value3(kv) for kv in alist]
print (vectorize3(a)) #shows [1,1,1]
print (vectorize3(b)) #shows [1,2,2]
#so yes, that can also work.
#and since the internal state in `work3` is stored in the exact same
#format, it be accessed the same way as `work2` to reconstruct input
#from output.
def unvectorize3(a_vector, pattern=('shape','fill','size')):
reverser = [{v:k2 for k2,v in work3[k].items()} for k in pattern]
for index, vect in enumerate(a_vector):
yield pattern[index], reverser[index][vect]
print (list(unvectorize3(vectorize3(a))))
print (list(unvectorize3(vectorize3(b))))
Final comments:
Each of these implementations suffer from storing state in a global variable. Which I find anti-aesthetic but depending on what you're planning to do with that vector later, that might be a feature. As I demonstrated.
Edit: Another day of meditating on this, and the sorts of situations where I might need it, I think that I'd encapsulate this feature like this:
from collections import defaultdict
from itertools import count
class slotter4:
def __init__(self):
#keep track what order we expect to see keys
self.pattern = defaultdict(count(1).__next__)
#keep track of what values we've seen and what number we've assigned to mean them.
self.work = defaultdict(lambda :defaultdict(count(1).__next__))
def slot(self, kv, i=False):
"""used to be named verbal_to_value"""
k, v = kv
if i and i != self.pattern[k]:# keep track of order we saw initial keys
raise ValueError("Input fields out of order")
#in theory we could ignore this error, and just know
#that we're going to default to the field order we saw
#first. Or we could just not keep track, which might be
#required, if our code runs to slow, but then we cannot
#make pattern optional in .unvectorize()
return self.work[k][v]
def vectorize(self, alist):
return [self.slot(kv, i) for i, kv in enumerate(alist,1)]
#if we're not keeping track of field pattern, we could do this instead
#return [self.work[k][v] for k, v in alist]
def unvectorize(self, a_vector, pattern=None):
if pattern is None:
pattern = [k for k,v in sorted(self.pattern.items(), key=lambda a:a[1])]
reverser = [{v:k2 for k2,v in work3[k].items()} for k in pattern]
return [(pattern[index], reverser[index][vect])
for index, vect in enumerate(a_vector)]
#test suite4:
s = slotter4()
if __name__=='__main__':
Av = s.vectorize(a)
Bv = s.vectorize(b)
print (Av) #shows [1,1,1]
print (Bv) #shows [1,2,2]
print (s.unvectorize(Av))#shows a
print (s.unvectorize(Bv))#shows b
else:
#run the test silently, and only complain if something has broken
assert s.unvectorize(s.vectorize(a))==a
assert s.unvectorize(s.vectorize(b))==b
Good luck out there!
Upvotes: 2
Reputation: 815
Not the best approach, but may help you to figure out a better solution
class Shape:
counter = {}
def to_tuple(self, tuples):
self.tuples = tuples
self._add()
l = []
for i,v in self.tuples:
l.append(self.counter[i][v])
return l
def _add(self):
for i,v in self.tuples:
if i in self.counter.keys():
if v not in self.counter[i]:
self.counter[i][v] = max(self.counter[i].values()) +1
else:
self.counter[i] = {v: 0}
a = [('shape', 'rectangle'), ('fill', 'no'), ('size', 'huge')]
b = [('shape', 'rectangle'), ('fill', 'yes'), ('size', 'large')]
s = Shape()
s.to_tuple(a)
s.to_tuple(b)
Upvotes: 1