Reputation: 2194
I have a relatively large csv file containing a list of companies, products, and prices. The ordering of the data is not guaranteed (i.e. not sorted):
csv#1 (big file)
...
CompanyA productB 0
CompanyA productA 0
CompanyA productC 0
CompanyB productA 0
CompanyB productB 0
CompanyB productC 0
...
Some of the entries in "csv#1" have bad data (zeroes). I have a second csv containing only the names from csv#1 that had bad data (and their corrected data). The ordering of this csv is by descending price:
csv#2 (small file - subset of csv#1)
CompanyA productC 15
CompanyA productB 10
CompanyA productA 5
CompanyB productA 3
CompanyB productB 2
CompanyB productC 1
I want to iterate through csv#1 and if the combination of Company + product is in csv#2, overwrite with csv#2 price.
I know I can do this by brute force, iterating over csv#2 for every row in csv#1. I could even optimize by loading csv#2 into an array and removing entries once they are found (each combination will show up exactly once in csv#1). But I am certain there must be a better way.
I found some references indicating that sets
are a more efficient way to do these kinds of lookup searches:
Most efficient way for a lookup/search in a huge list (python)
Fastest way to search a list in python
But I am not sure how to apply sets
to my example. How to I structure a set
here, given the multiple search columns, and the need to return a value if there is a match? Or is there a better approach than sets
?
Upvotes: 0
Views: 1873
Reputation: 4366
I would suggest loading csv#2 into a dictionary which is actually a hash table and queries are fast
Set is also a hash table without values, but you have values here
The keys of dict are tuples of (companyName, productName) , and values are prices
Then iterate over csv#1 and check if the correction dict has the key for that company name (use has_key, or simple get the key in try ... except block) and if there was, do the fix using associated price value
Upvotes: 1
Reputation: 8336
Since you could technically associate a key with a value, why not use a dictionary? It has constant lookup time O(1) instead of O(N) for a list. It is similar to a set except for the concept of key value pair.
csv1_dict = { ...,
"CompanyA productA" : 0,
"CompanyA productB" : 0,
...
}
csv2_dict = { ...,
"CompanyA productA" : 10,
...
}
for key,value in csv2_dict.iteritems():
try:
csv1_dict[key] = csv2_dict[key]
except:
#Key not in csv1
If you can guarantee that Company products in csv2 are in csv1, feel free to remove the try block.
Upvotes: 2