Reputation: 83
Update: It's been over 24 hours and the code is still not done yet :)
I have this python code below. Basically, this code is currently only using 1% of the dataset (that's why it's called sample). It has 32968 rows of just names. I cleaned out the punctuation and put it all in lowercase already.
My problem is that this code has been running for 8 hours so far and isn't done yet. Since, as mentioned, I'm only using 1% of the data, I will need to run this code again later on the entire dataset, which would take 100x the time for this one. I don't think waiting 800 hours is a good idea. So for my question:
Are there any ways I can make it faster? Should learn about spark or mapreduce and try to utilize those for this code?
EDIT: Okay, I will try to add more information on what the code is actually doing. An example of the names before they are cleaned:
import pandas as pd
import numpy as np
data = {'clean_name': ['Abbott Laboratories','Apple Computers', 'Apple, Inc.', 'Abercrombie & Fitch Co.', 'ABM Industries Incorporated', 'Ace Hardware Corporation'], 'name_group': np.zeros(6, dtype=int)}
sample = pd.DataFrame(data)
sample
Out[2]:
clean_name name_group
0 Abbott Laboratories 0
1 Apple Computers 0
2 Apple, Inc. 0
3 Abercrombie & Fitch Co. 0
4 ABM Industries Incorporated 0
5 Ace Hardware Corporation 0
Then, I clean it of punctuation and putting it all in lowercase. Basically, I want to compare each name with the next and if it's similar, I'll give it the same group number. Something like this:
sample
Out[28]:
clean_name name_group
0 abbott laboratories 0
1 apple computers 1
2 apple inc 1
3 abercrombie fitch co 0
4 abm industries incorporated 0
5 ace hardware corporation 0
The code below is what I came up with:
i = 1
for alpha,beta in itertools.combinations(sample.clean_name, 2):
score = fuzz.token_sort_ratio(alpha, beta)
A = sample.loc[sample.clean_name==alpha, 'name_group'].values[0]
B = sample.loc[sample.clean_name==beta, 'name_group'].values[0]
if score > 60:
if ((B != 0) & (A !=0)): continue
if ((A == 0) & (B !=0)): A = B
elif ((B == 0) & (A !=0)): B = A
elif ((B == 0) & (A ==0)):
A, B = i, i
i += 1
sample.loc[sample.clean_name==alpha, 'name_group'] = A
sample.loc[sample.clean_name==beta, 'name_group'] = B
Upvotes: 3
Views: 4173
Reputation: 44545
Using itertools.combinations
for 32k rows will certainly make your code slow. Here is an approach using numpy instead of pandas on a smaller dataset to address the following objectives:
Use this post as a means to attack your problem from a different angle.
Given
Here we build a small list of company names A
, B
, C
and Aa
:
import itertools as it
import collections as ct
import numpy as np
companies = "A B C Aa".split()
Code
Step 1
First we will create a 2D array where the horizontal and vertical indices are identical company names. Inside the matrix will comprise the merged company names:
# 1. Build a 2D array of joined strings
def get_2darray(seq):
"""Return a 2D array of identical axes."""
x = np.array(seq)
y = np.array(seq)
xx = x[:, np.newaxis]
yy = y[np.newaxis, :]
return np.core.defchararray.add(xx, yy) # ref 001
Demo
arr = get_2darray(companies)
arr
# array([['AA', 'AB', 'AC', 'AAa'],
# ['BA', 'BB', 'BC', 'BAa'],
# ['CA', 'CB', 'CC', 'CAa'],
# ['AaA', 'AaB', 'AaC', 'AaAa']],
# dtype='<U4')
Step 2
Second we implement a group
function for enumerating similar companies. Given a 2D array, a helper function (func
) will be used to "transform" each element to a group number:
# 2. Group companies by "similarity", e.g. "AB" == "BA"
def group(func, arr, pred=None, verbose=False):
"""Return an array of items enumerated by similarity."""
if pred is None:
# Set diagnol to zero
pred = lambda x: len(set(x)) != len(x)
dd = ct.defaultdict(it.count().__next__)
dd[""] = np.nan
# opt_func = np.vectorize(func) # optional, slower
opt_func = np.frompyfunc(func, 3, 1) # ref 002
m = opt_func(arr, dd, pred)
if verbose: print(dd)
return m
def transform(item, lookup, pred):
"""Return a unique group number element-wise."""
unique_idx = "".join(sorted(item.lower()))
name_group = lookup[unique_idx]
if pred(item):
return 0
else:
return name_group
Demo
groups = group(transform, arr, verbose=True)
groups
# defaultdict(<method-wrapper '__next__' of itertools.count object at 0x00000000062BE408>,
# {'': nan, 'aaa': 3, 'aac': 8, 'ab': 1,
# 'cc': 7, 'aa': 0, 'bc': 5, 'aaaa': 9,
# 'ac': 2, 'bb': 4, 'aab': 6})
# array([[0, 1, 2, 0],
# [1, 0, 5, 6],
# [2, 5, 0, 8],
# [0, 6, 8, 0]], dtype=object)
Each company name is grouped with a unique number.
Step 3
You can now access the group number for two companies by slicing the groups
array:
# 3. Lookup the group number shared by companies
reversed_lookup = {v:k for k, v in enumerate(companies)}
def group_number(arr, a, b):
"""Return the name_group given company names, in 2D array `m`."""
i, j = reversed_lookup[a], reversed_lookup[b]
return arr[i, j]
for companies in [["B", "C"], ["A", "B"], ["C", "C"]]:
msg = "Companies {}: group {}"
print(msg.format(" & ".join(companies), group_number(groups, *companies)))
# Companies B & C: group 5
# Companies A & B: group 1
# Companies C & C: group 0
Details
Step 1
Why use arrays? A numpy array allows for fast lookups just like pandas. In addition, we can later optimize performance using operations implemented at the C level (these are fast).
Why merge company names in the array? The 2D array of merged strings is used for comparing company names. This way of comparison resembles a statistical correlation matrix.
Step 2
How are groups determined? The company names are passed to a special dictionary (dd
) that only assigns an incremented integer when a new key is found. This dictionary is used to track groups as the transform
helper function is applied to each element.
Why use a helper function? The tranform
function converts each item in the array to a group number. Notice the tracking dictionary (lookup
) is passed in with a predicate. Here are some notes about these group
parameters:
pred=None
), a default predicate is applied, which naively compares strings with identical names (particularly along the diagnol).You may wish to use another predicate. For example, from the default predicate, any set of a lowered strings is equivalent such that A == Aa == AaAa
(see the corners of the array are assigned to group 0). Here is a another sample predicate that distinguishes A
from Aa
(groups 0 and 3 respectively):
pred = lambda x: all(not(v%2) for k, v in ct.Counter(x).items())
group(transform, arr, pred)
# array([[0, 1, 2, 3],
# [1, 0, 5, 6],
# [2, 5, 0, 8],
# [3, 6, 8, 0]], dtype=object)
How is performance optimized? Some operations are vectorized to help speed up the code using C implementations. In the group
function, numpy.frompyfun
wraps the helper function. It was determined that this particular "universal function" is faster than a vectorizing function numpy.vectorize
. See also Scipy Lecture Notes on more ways to optimize numpy code.
Step 3
How to find a group number for two companies? This is done simply by slicing the returned array from the group
function. group_number
is a slicing function for querying the array. Since the indices are now numeric from Step 2, we build a reversed dictionary from our starting, ordered sequence companies
to find the corresponding numeric index by company name. Note, the reversed dictionary is built outside of the slicing function to avoid rebuilding the dictionary after each query.
Performance
How fast is it? For our simple sequence of < 10 rows, the speed is sub-milliseconds:
%timeit group(transform, arr)
# 10000 loops, best of 3: 110 µs per loop
For demonstration, let's scale this data up around 1000 rows (beyond that, even creating of the dataset takes long and consumes memory).
test = tuple(map(str, range(1000)))
full_arr = get_2darray(test)
print(full_arr.shape)
full_arr
# (1000, 1000)
# array([['00', '01', '02', ..., '0997', '0998', '0999'],
# ['10', '11', '12', ..., '1997', '1998', '1999'],
# ['20', '21', '22', ..., '2997', '2998', '2999'],
# ...,
# ['9970', '9971', '9972', ..., '997997', '997998', '997999'],
# ['9980', '9981', '9982', ..., '998997', '998998', '998999'],
# ['9990', '9991', '9992', ..., '999997', '999998', '999999']],
# dtype='<U6')
%timeit group(transform, full_arr)
# 1 loop, best of 3: 5.3 s per loop
Save some computation time by evaluating only half of the matrix:
half_arr = np.triu(test)
half_arr
# array([['00', '01', '02', ..., '0997', '0998', '0999'],
# ['', '11', '12', ..., '1997', '1998', '1999'],
# ['', '', '22', ..., '2997', '2998', '2999'],
# ...,
# ['', '', '', ..., '997997', '997998', '997999'],
# ['', '', '', ..., '', '998998', '998999'],
# ['', '', '', ..., '', '', '999999']],
# dtype='<U6')
%timeit group(transform, half_arr)
# 1 loop, best of 3: 3.61 s per loop
Note: profiling was not performed on a dataset of 32k rows.
Conclusions
In this approach, the aforementioned objectives were acheived by:
Consider numpy for optimizing comparison functions at the C-level. While the performance tests in this post may still take time, numpy offers headroom for further optimizations. Moreover, it is probable that this code, as is, will take less time than 8 hours on the OP's dataset. Further profiling is required to assess the complexity of this approach. If the complexity is reasonable, the user may decide how to proceed, e.g. parallel processing on multiple threads. Such tasks are left to those whom may be interested.
References
Upvotes: 6