pythonperformancebigdatapython-itertoolsdata-science

Reputation: 83

My python code is taking 8+ hours to iterate over big data

Update: It's been over 24 hours and the code is still not done yet :)

I have this python code below. Basically, this code is currently only using 1% of the dataset (that's why it's called sample). It has 32968 rows of just names. I cleaned out the punctuation and put it all in lowercase already.

My problem is that this code has been running for 8 hours so far and isn't done yet. Since, as mentioned, I'm only using 1% of the data, I will need to run this code again later on the entire dataset, which would take 100x the time for this one. I don't think waiting 800 hours is a good idea. So for my question:

Are there any ways I can make it faster? Should learn about spark or mapreduce and try to utilize those for this code?

EDIT: Okay, I will try to add more information on what the code is actually doing. An example of the names before they are cleaned:

import pandas as pd
import numpy as np

data = {'clean_name': ['Abbott Laboratories','Apple Computers', 'Apple, Inc.', 'Abercrombie & Fitch Co.', 'ABM Industries Incorporated', 'Ace Hardware Corporation'], 'name_group': np.zeros(6, dtype=int)}

sample = pd.DataFrame(data)
sample

Out[2]: 
                    clean_name  name_group
0          Abbott Laboratories           0
1              Apple Computers           0
2                  Apple, Inc.           0
3      Abercrombie & Fitch Co.           0
4  ABM Industries Incorporated           0
5     Ace Hardware Corporation           0

Then, I clean it of punctuation and putting it all in lowercase. Basically, I want to compare each name with the next and if it's similar, I'll give it the same group number. Something like this:

sample
Out[28]: 
                    clean_name  name_group
0          abbott laboratories           0
1              apple computers           1
2                  apple  inc            1
3      abercrombie   fitch co            0
4  abm industries incorporated           0
5     ace hardware corporation           0

The code below is what I came up with:

i = 1
for alpha,beta in itertools.combinations(sample.clean_name, 2):
    score = fuzz.token_sort_ratio(alpha, beta)
    A = sample.loc[sample.clean_name==alpha, 'name_group'].values[0]
    B = sample.loc[sample.clean_name==beta, 'name_group'].values[0]
    if score > 60:
        if ((B != 0) & (A !=0)): continue
        if ((A == 0) & (B !=0)): A = B
        elif ((B == 0) & (A !=0)): B = A
        elif ((B == 0) & (A ==0)):
            A, B = i, i
            i += 1
        sample.loc[sample.clean_name==alpha, 'name_group'] = A
        sample.loc[sample.clean_name==beta, 'name_group'] = B

Upvotes: 3

Answers (1)

pylang

Reputation: 44545

Using itertools.combinations for 32k rows will certainly make your code slow. Here is an approach using numpy instead of pandas on a smaller dataset to address the following objectives:

Implement a function for grouping company names by some condition (a predicate)
Make the function(s) faster than the posted implementation

Use this post as a means to attack your problem from a different angle.

Given

Here we build a small list of company names A, B, C and Aa:

import itertools as it
import collections as ct

import numpy as np


companies = "A B C Aa".split()

Code

Step 1

First we will create a 2D array where the horizontal and vertical indices are identical company names. Inside the matrix will comprise the merged company names:

# 1. Build a 2D array of joined strings
def get_2darray(seq):
    """Return a 2D array of identical axes."""
    x = np.array(seq)
    y = np.array(seq)

    xx = x[:, np.newaxis]
    yy = y[np.newaxis, :]

    return np.core.defchararray.add(xx, yy)                # ref 001

Demo

arr = get_2darray(companies)
arr
# array([['AA', 'AB', 'AC', 'AAa'],
#        ['BA', 'BB', 'BC', 'BAa'],
#        ['CA', 'CB', 'CC', 'CAa'],
#        ['AaA', 'AaB', 'AaC', 'AaAa']], 
#       dtype='<U4')

Step 2

Second we implement a group function for enumerating similar companies. Given a 2D array, a helper function (func) will be used to "transform" each element to a group number:

# 2. Group companies by "similarity", e.g. "AB" == "BA"
def group(func, arr, pred=None, verbose=False):
    """Return an array of items enumerated by similarity."""
    if pred is None:
        # Set diagnol to zero
        pred = lambda x: len(set(x)) != len(x)

    dd = ct.defaultdict(it.count().__next__)
    dd[""] = np.nan 

    # opt_func = np.vectorize(func)                        # optional, slower
    opt_func = np.frompyfunc(func, 3, 1)                   # ref 002

    m = opt_func(arr, dd, pred)
    if verbose: print(dd)
    return m


def transform(item, lookup, pred):
    """Return a unique group number element-wise."""
    unique_idx = "".join(sorted(item.lower()))
    name_group = lookup[unique_idx]
    if pred(item):
        return 0
    else:
        return name_group

Demo

groups = group(transform, arr, verbose=True)
groups
# defaultdict(<method-wrapper '__next__' of itertools.count object at 0x00000000062BE408>,
# {'': nan, 'aaa': 3, 'aac': 8, 'ab': 1, 
# 'cc': 7, 'aa': 0, 'bc': 5, 'aaaa': 9,
# 'ac': 2, 'bb': 4, 'aab': 6})

# array([[0, 1, 2, 0],
#        [1, 0, 5, 6],
#        [2, 5, 0, 8],
#        [0, 6, 8, 0]], dtype=object)

Each company name is grouped with a unique number.

Step 3

You can now access the group number for two companies by slicing the groups array:

# 3. Lookup the group number shared by companies
reversed_lookup = {v:k for k, v in enumerate(companies)}

def group_number(arr, a, b):
    """Return the name_group given company names, in 2D array `m`."""
    i, j = reversed_lookup[a], reversed_lookup[b]
    return arr[i, j]


for companies in [["B", "C"], ["A", "B"], ["C", "C"]]:
    msg = "Companies {}: group {}"
    print(msg.format(" & ".join(companies), group_number(groups, *companies)))


# Companies B & C: group 5
# Companies A & B: group 1
# Companies C & C: group 0

Details

Step 1

Why use arrays? A numpy array allows for fast lookups just like pandas. In addition, we can later optimize performance using operations implemented at the C level (these are fast).

Why merge company names in the array? The 2D array of merged strings is used for comparing company names. This way of comparison resembles a statistical correlation matrix.

Step 2

How are groups determined? The company names are passed to a special dictionary (dd) that only assigns an incremented integer when a new key is found. This dictionary is used to track groups as the transform helper function is applied to each element.

Why use a helper function? The tranform function converts each item in the array to a group number. Notice the tracking dictionary (lookup) is passed in with a predicate. Here are some notes about these group parameters:

The keys of the tracking dictionary are made by lowering and sorting a given string. This technique is used internally for equating strings with swapped company names. For example, merged companies "AB" and "BA" should belong to the same group.
The predicate is determined by the user. If no predicate is given (pred=None), a default predicate is applied, which naively compares strings with identical names (particularly along the diagnol).

You may wish to use another predicate. For example, from the default predicate, any set of a lowered strings is equivalent such that A == Aa == AaAa (see the corners of the array are assigned to group 0). Here is a another sample predicate that distinguishes A from Aa (groups 0 and 3 respectively):

pred = lambda x: all(not(v%2) for k, v in ct.Counter(x).items())
group(transform, arr, pred)
# array([[0, 1, 2, 3],
#        [1, 0, 5, 6],
#        [2, 5, 0, 8],
#        [3, 6, 8, 0]], dtype=object)

How is performance optimized? Some operations are vectorized to help speed up the code using C implementations. In the group function, numpy.frompyfun wraps the helper function. It was determined that this particular "universal function" is faster than a vectorizing function numpy.vectorize. See also Scipy Lecture Notes on more ways to optimize numpy code.

Step 3

How to find a group number for two companies? This is done simply by slicing the returned array from the group function. group_number is a slicing function for querying the array. Since the indices are now numeric from Step 2, we build a reversed dictionary from our starting, ordered sequence companies to find the corresponding numeric index by company name. Note, the reversed dictionary is built outside of the slicing function to avoid rebuilding the dictionary after each query.

Performance

How fast is it? For our simple sequence of < 10 rows, the speed is sub-milliseconds:

%timeit group(transform, arr)
# 10000 loops, best of 3: 110 µs per loop

For demonstration, let's scale this data up around 1000 rows (beyond that, even creating of the dataset takes long and consumes memory).

test = tuple(map(str, range(1000)))
full_arr = get_2darray(test)
print(full_arr.shape)
full_arr
# (1000, 1000)
# array([['00', '01', '02', ..., '0997', '0998', '0999'],
#        ['10', '11', '12', ..., '1997', '1998', '1999'],
#        ['20', '21', '22', ..., '2997', '2998', '2999'],
#        ..., 
#        ['9970', '9971', '9972', ..., '997997', '997998', '997999'],
#        ['9980', '9981', '9982', ..., '998997', '998998', '998999'],
#        ['9990', '9991', '9992', ..., '999997', '999998', '999999']], 
#       dtype='<U6')

%timeit group(transform, full_arr)
# 1 loop, best of 3: 5.3 s per loop

Save some computation time by evaluating only half of the matrix:

half_arr = np.triu(test)
half_arr
# array([['00', '01', '02', ..., '0997', '0998', '0999'],
#        ['', '11', '12', ..., '1997', '1998', '1999'],
#        ['', '', '22', ..., '2997', '2998', '2999'],
#        ..., 
#        ['', '', '', ..., '997997', '997998', '997999'],
#        ['', '', '', ..., '', '998998', '998999'],
#        ['', '', '', ..., '', '', '999999']], 
#       dtype='<U6')

%timeit group(transform, half_arr)
# 1 loop, best of 3: 3.61 s per loop

Note: profiling was not performed on a dataset of 32k rows.

Conclusions

In this approach, the aforementioned objectives were acheived by:

separating data munging and evaluation of a small dataset into the Steps 1 and 2.
analysis by slicing the final numpy array of grouped companies in Step 3.

Consider numpy for optimizing comparison functions at the C-level. While the performance tests in this post may still take time, numpy offers headroom for further optimizations. Moreover, it is probable that this code, as is, will take less time than 8 hours on the OP's dataset. Further profiling is required to assess the complexity of this approach. If the complexity is reasonable, the user may decide how to proceed, e.g. parallel processing on multiple threads. Such tasks are left to those whom may be interested.

References

001: How to merge strings in a numpy array
002: Vectorizing functions

Upvotes: 6

My python code is taking 8+ hours to iterate over big data

Answers (1)

Related Questions