Reputation: 3564
The problem I'm trying to solve is to find the support of each itemset in transactional data.
For example,
transactions = [
'b c d',
'a g' ,
'a c d e',
'e f h',
'a b c g h',
'd' ,
'a e g h',
'b c d',
'a b f g h',
'a c d g',
]
will have [2, 5, 1, 1, 1, 5, 1, 2, 1, 1]
So basically for the second transaction a, g
, it is a subset of other transactions like 'a g'
, 'a b c g h'
, 'a e g h'
, 'a b f g h'
, 'a c d g'
and hence the count is 5.
Now, initially, I was converting this dataset to a kind of One Hot Encoded transaction using the mlxtend transactional encoder. And used something like
df.progress_apply(lambda x: (df.iloc[:, np.where(x==1)[0]].sum(1)==len(np.where(x==1)[0])).sum(), axis=1)
to get the values.
The idea is like slice the matrix/df with the elements of the present row and then sum across rows. The cases where it is the same as the length of elements of the present row is a subset and hence count it.
However, this worked fine for smaller datasets, and then when I came across the kosarak, I cannot have a dense representation because of OOM error. So, I switched back to countVectorizer and generated a sparse representation and then used a similar logic as the previous one.
Now the issue is, the scipy sparse is 4x slow when doing sum on sparse than dense with a run time of
164 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Even using sets to solve the problem didn't improve things much.
As far, this was my approach and I believe it has O(n2) complexity. Is there any better algorithm/package to speed things up.
Any help is appreciated. Thanks in advance.
Upvotes: 6
Views: 815
Reputation: 114230
You can use numba to speed up your computation over a dense array. Start with the one-hot encoding.
transactions = list(map(str.split, transactions))
# First compute indices for each element of transactions
indices = dict(item[::-1] for item in enumerate(set(x for t in transactions for x in t)))
# Make the matrix len(transactions)-by-len(indices)
A = np.zeros((len(transactions), len(indices)), bool)
# Fill in the matrix
for r, item in enumerate(transactions):
A[r, [indices[t] for t in item]] = True
Now you can dot each row of A.T
by A
, threshold it, and sum the elements that pass:
@numba.njit
def check_subsets(A):
output = np.empty(len(A), dtype=np.int_)
for r, row in enumerate(A):
output[r] = np.sum((A * row).sum(1) >= row.sum())
return output
>>> check_subsets(A)
array([2, 5, 1, 1, 1, 5, 1, 2, 1, 1])
This does make things faster for your original dataset:
%timeit check_subsets(A)
5.2 µs ± 343 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit (np.matmul(A, A.T, dtype=int) >= A.sum(1, keepdims=True)).sum(1)
7.86 µs ± 39 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The improvement does not hold for large datasets, but the memory footprint is much smaller.
np.random.seed(42)
B = np.random.randint(2, size=(10000, 26), dtype=bool)
%timeit check_subsets(B)
4.48 s ± 62.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit (np.matmul(B, B.T, dtype=int) >= B.sum(1, keepdims=True)).sum(1)
2.31 s ± 96.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Both benchmarks were run after the initial compilation phase of numba's JIT compiler. Note that the naive implementation without buffers is approximately 5% slower than the one shown.
Upvotes: 1
Reputation: 1677
If possible python's set arithmetic usually is pretty decent and does not involve any convoluted binarizing logic, that is arguably harder to read/understand.
Just a suggestion to build upon:
transactions = [
'b c d',
'a g' ,
'a c d e',
'e f h',
'a b c g h',
'd',
'a e g h',
'b c d',
'a b f g h',
'a c d g',
]
transactions = list(map(lambda x: x.replace(' ', ''), transactions))
print(transactions) # ['bcd', 'ag', 'acde', 'efh', 'abcgh', 'd', 'aegh', 'bcd', 'abfgh', 'acdg']
transactions_set = list(map(set, transactions))
counts = [sum(set(elem).issubset(s) for s in transactions_set) for elem in transactions]
print(counts) # [2, 5, 1, 1, 1, 5, 1, 2, 1, 1]
Upvotes: 0
Reputation: 2604
My small attempt
If your current approach is ~164ms per loop, this one goes back to the *8 efficiency. Unfortunately I can't lay claims to anything genius and I'm afraid its still too slow. I just pre-built all the sets, then run in the most straightforward manner with the issubset as @solid.py . The difference in building the sets beforehands and simply using a for-loop instead of a function call was 6 times.
Current timing of the one set check is ~22ms +-2ms
or something like that. I've been testing directly on the kosarak dataset so I hope there is only one dataset with such a name.
I have tried few "smarter" ways to kill off the implausible options, unfortunately all of them ended up slower than this "stupid" and direct one.
Few of the ways that might be actually useful:
sort the sets by the size, then compute the matches only with those of >= length. The length check is the first one in the .issubset
anyways.
Since first ~30 000 sets are one-transaction-only and another ~35000
sets consist of two transactions this could mean removing ~30% of the computation. Maybe more since the few-transaction-sets could be cached for further improvement.
This leads to caching of the results - at least the short ones. Creating a 1:{2:{}} structure is a rather cheap and it allows you to reuse the result. Using it even on the unsorted values resulted in ~1,5ms
or so increase in performance. It is not much but it could be even more with the sorting. It is also possible to cut off this caching when the sets get bigger(and thus the probability of having the result cached grows smaller).
Generally there are several transaction that repeat several hundred if not thousand times. This would help cut them down, further lessening the n in O(n^2) Unfortunately I don't have anything to lower the complexity by itself.
Expanding on the caching - sorting and counting the sets beforehands could be also used to replace each set with a tuple (set, count). This would remove the need for caching ^ altogether and would remove the most of the unnecessary computations.
import csv
import time
reader = csv.reader(open('kosarak.csv'), delimiter=' ')
dataLines = []
for line in reader:
dataLines.append(set(map(int, line)))
results = []
count = 0
totalTime = 0
for line1 in dataLines:
r1 = 0
t1 = time.time_ns()
for line2 in dataLines:
if line1.issubset(line2):
r1 += 1
t2 = time.time_ns()
results.append(r1)
totalTime += (t2 - t1) / 1000000
count += 1
if (count % 100) == 0:
print("$$$$$$$$$$$$$")
print(totalTime)
print(totalTime / count)
print(count)
Upvotes: 0
Reputation: 14399
Since 2**26 is well below the integer limit on 32-bit integers, you can do this:
digitize = lambda x: np.in1d(list(string.ascii_lowercase), x.split()) @ 2 ** np.arange(26)
digitize
converts the strings of letters into a unique bitwise integer for each set of letters. Since the data is bitwise, it can be compared with bit arithmetic.
trans = np.array([digitize(t) for t in transactions])
Out[]: array([ 14, 65, 29, 176, 199, 8, 209, 14, 227, 77], dtype=int32)
(np.bitwise_and.outer(tr, tr) == tr).sum(0) #bitwise definition of subset, summed over entries
Out[]: array([2, 5, 1, 1, 1, 5, 1, 2, 1, 1])
you could easily make a column of trans
and then apply the bitwise function to get your desired output. Should reduce memory usage by not storing those big onehots as well.
Upvotes: 1