Roberto
Roberto

Reputation: 2194

Python lookup search in a list: multiple search criteria, return value required

I have a relatively large csv file containing a list of companies, products, and prices. The ordering of the data is not guaranteed (i.e. not sorted):

csv#1 (big file)        
... 
CompanyA     productB     0
CompanyA     productA     0
CompanyA     productC     0
CompanyB     productA     0
CompanyB     productB     0
CompanyB     productC     0
...     

Some of the entries in "csv#1" have bad data (zeroes). I have a second csv containing only the names from csv#1 that had bad data (and their corrected data). The ordering of this csv is by descending price:

csv#2 (small file - subset of csv#1)        
CompanyA     productC     15
CompanyA     productB     10
CompanyA     productA     5
CompanyB     productA     3
CompanyB     productB     2
CompanyB     productC     1

I want to iterate through csv#1 and if the combination of Company + product is in csv#2, overwrite with csv#2 price.

I know I can do this by brute force, iterating over csv#2 for every row in csv#1. I could even optimize by loading csv#2 into an array and removing entries once they are found (each combination will show up exactly once in csv#1). But I am certain there must be a better way.

I found some references indicating that sets are a more efficient way to do these kinds of lookup searches:

Most efficient way for a lookup/search in a huge list (python)

Fastest way to search a list in python

But I am not sure how to apply sets to my example. How to I structure a set here, given the multiple search columns, and the need to return a value if there is a match? Or is there a better approach than sets?

Upvotes: 0

Views: 1873

Answers (2)

saeedgnu
saeedgnu

Reputation: 4366

I would suggest loading csv#2 into a dictionary which is actually a hash table and queries are fast

Set is also a hash table without values, but you have values here

The keys of dict are tuples of (companyName, productName) , and values are prices

Then iterate over csv#1 and check if the correction dict has the key for that company name (use has_key, or simple get the key in try ... except block) and if there was, do the fix using associated price value

Upvotes: 1

C.B.
C.B.

Reputation: 8336

Since you could technically associate a key with a value, why not use a dictionary? It has constant lookup time O(1) instead of O(N) for a list. It is similar to a set except for the concept of key value pair.

csv1_dict = {  ...,
            "CompanyA productA" : 0,
            "CompanyA productB" : 0,
            ...
            }

csv2_dict = { ...,
            "CompanyA productA" : 10,
              ...
            }
for key,value in csv2_dict.iteritems():
    try:
        csv1_dict[key] = csv2_dict[key]
    except:
        #Key not in csv1

If you can guarantee that Company products in csv2 are in csv1, feel free to remove the try block.

Upvotes: 2

Related Questions