jimstandard
jimstandard

Reputation: 1147

Could someone give me their two cents on this optimization strategy

Background: I am writing a matching script in python that will match records of a transaction in one database to names of customers in another database. The complexity is that names are not unique and can be represented multiple different ways from transaction to transaction.

Rather than doing multiple queries on the database (which is pretty slow) would it be faster to get all of the records where the last name (which in this case we will say never changes) is "Smith" and then have all of those records loaded into memory as you go though each looking for matches for a specific "John Smith" using various data points.

Would this be faster, is it feasible in python, and if so does anyone have any recommendations for how to do it?

Upvotes: 0

Views: 106

Answers (3)

Dave
Dave

Reputation: 3956

The problem is not one of efficiency, but of correctness. Regardless of whether you do multiple small queries on the database or a single large one, if the names are neither unique nor consistent, what are you going to do with them?

Transaction 1: name="John Smith"
Transaction 2: name="John T. Smith"
Transaction 3: name="John Smith, Jr."
Transaction 4: name="Johnny Smith"

There could be anywhere between 1 and 4 different people behind these transactions, and without other identifying information (such as credit card number, email address, shipping address), what is your program going to do once it has found all the "Smiths"?

To answer the question though, "it depends". One might assume that a single large query would be faster, but if it returns 99% chaff (Bob Smiths, Terry Smiths, etc), querying each name individually could well be much faster. If you do have supplemental information such as credit card number that is both "more unique" and indexed, it would probably be a better strategy to query against that rather than name.

Upvotes: 1

BRPocock
BRPocock

Reputation: 13934

Regarding: "would this be faster:"

The behind-the-scenes logistics of the SQL engine are really optimized for this sort of thing. You might need to create an SQL PROCEDURE or a fairly complex query, however.

Caveat, if you're not particularly good at or fond of maintaining SQL, and this isn't a time-sensitive query, then you might be wasting programmer time over CPU/IO time in getting it right.

However, if this is something that runs often or is time-sensitive, you should almost certainly be building some kind of JOIN logic in SQL, passing in the appropriate values (possibly wildcards), and letting the database do the filtering in the relational data set, instead of collecting a larger number of "wrong" records and then filtering them out in procedural code.

You say the database is "pretty slow." Is this because it's on a distant host, or because the tables aren't indexed for the types of searches you're doing? … If you're doing a complex query against columns that aren't indexed for it, that can be a pain; you can use various SQL tools including ANALYZE to see what might be slowing down a query. Most SQL GUI's will have some shortcuts for such things, as well.

Upvotes: 2

Raymond Hettinger
Raymond Hettinger

Reputation: 226754

Your strategy is reasonable though I would first look at doing as much of the work as possible in the database query using LIKE and other SQL functions. It should be possible to make a query that matches complex criteria.

Upvotes: 0

Related Questions