Optimizing Levenshtein distance algorithm in Salesforce

I have a custom object called customer with fields such as Customer_Name, Address_Line_1, Post_Code etc.

I'd like to run through all records and compare Customer_Name for likeness (based on fuzzy search or levenshtein distance). If the likeness is above or below a certain threshold, a custom field (Possible_Duplicate_Customer_ID__c) will be updated to identify the possible duplicate.

I've managed to implement this but I'm experiencing 2 issues:

1). exceeding Salesforce govenor limits (too many script statements: 200001) likely caused by the heavy looping required by the Levenshtein distance algorithm. 2). Also the list (newList) I am committing contains duplicate ID's.

    private static List<Customer__c> newList = new List<Customer__c>();

webService static Integer findDupes() {

    Integer returnCount = 0;
    Double cost = 0;
    Integer COST_THRESHOLD = 5;

    Map<id,Customer__c> cMap = new Map<id,Customer__c>([
        select ID, Name, Customer_Name__c, Possible_Duplicate_Customer_ID__c 
        from Customer__c 
    ]);

    List<Customer__c> custList1 = cMap.values();        
    List<Customer__c> custList2 = custList1.clone();

    for (Customer__c cust1 :custList1) {
        for (Customer__c cust2 :custList2) {
            cost = LevenshteinDistance.computeLevenshteinDistance(
                    cust1.Customer_Name__c, cust2.Customer_Name__c);
                if(cost<COST_THRESHOLD && cost != 0) {
                    Customer__c c = new Customer__c(
                        id = cust2.Id, 
                        Possible_Duplicate_Customer_ID__c = cust1.Name
                    );
                    newList.add(c);
                }
                System.debug(cost+' edits to transform '
                        +cust1.Customer_Name__c+' to '+cust2.Customer_Name__c);
        }
    }

    returnCount = newList.size();

    update newList;        
    return returnCount;
}

Upvotes: 3

Answers (3)

Marc

Reputation: 14295

Have you tried the new getLevenshteinDistance method of String?

See also my question/approach here here. I keep the number of initial matches down by insisting that only matches in the same country are returned with the same postal code or city.

Upvotes: 2

jkraybill

Reputation: 3409

Lev distance is a great tool for fuzzy matching, but basically unusable in Apex due to the script statement limits. Using a version I've got laying around (adapted from an old version of Apex Lang), comparing "0123456789" to "0246803579" takes 700+ script statements. Comparing "actual resource usage has basically no correlation with number of lines of code executed" to "yeah but annoying a 'few' advanced developers will allow us to cut corners during governor limit implementation" takes 60,000 script statements. Unless you're doing small numbers of small comparisons, or have somehow rewritten Lev to be more script statement friendly, it's going to be difficult to justify on the platform.

I've taken to using cheaper proxies for Lev in Apex, like Soundex for name or short word comparison, or fancy dynamic SOQL "LIKE" statements. If what you're trying to do can somehow be distilled into set operations, those give you a good bang for the buck in Apex since .contains only costs you one script execution.

If you really need to do lots of Lev, you may have to resort to using the API or rewriting the code to be a lot more line-compact. Pushing computation into the browser may also be an option depending on your use case.

Upvotes: 0

Matt Lacey

Reputation: 8255

I would suggest running the code inside a class which uses the batchable interface, this is much better suited to processing large volumes of data. Since your web service doesn't take an input to work with, you could run the batch hourly on a schedule, flag dupes by marking the records and then extract those in the webservice. Of course, if you need it to be in realtime you'll need to optimise this loop instead.

As for the duplicate ids in the update list, your use of cust2.Id for the updates should account for this, but you don't seem to protecting against the case where a customer record is compared with itself! This should fix it up:

for (Customer__c cust1 :custList1) {
    for (Customer__c cust2 :custList2) {
        if (cust1.Id == cust2.Id) {
            continue;
        }

Upvotes: 1

Optimizing Levenshtein distance algorithm in Salesforce

Answers (3)

Related Questions