Reputation: 1930
I'm working with huge data CSV file. Each file contains milions of record , each record has a key. The records are sorted by thier key. I dont want to go over the whole file when searching for certian data. I've seen this solution : Reading Huge File in Python
But it suggests that you use the same length of lines on the file - which is not supported in my case.
I thought about adding a padding to each line and then keeping a fixed line length , but I'd like to know if there is a better way to do it.
I'm working with python
Upvotes: 4
Views: 6188
Reputation: 184
To resolve it, you also can use binary search, but you need to change it a bit:
Here is a sample code:
fp = open('your file')
fp.seek(0, 2)
begin = 0
end = fp.tell()
while (begin < end):
fp.seek((end + begin) / 2, 0)
fp.readline()
line_key = get_key(fp.readline())
if (key == line_key):
pass # find what you want
elif (key > line_key):
begin = fp.tell()
else:
end = fp.tell()
Maybe the code has bugs. Verify yourself. And please check the performance if you really want a fastest way.
Upvotes: 2
Reputation: 3956
The answer on the referenced question that says binary search only works with fixed-length records is wrong. And you don't need to do a search at all, since you have multiple items to look up. Just walk through the entire file one line at a time, build a dictionary of key:offset
for each line, and then for each of your search items jump to the record of interest using os.lseek
on the offset corresponding to each key.
Of course, if you don't want to read the entire file even once, you'll have to do a binary search. But if building the index can be amortized over several lookups, perhaps saving the index if you only do one lookup per day, then a search is unnecessary.
Upvotes: 1
Reputation: 86768
You don't have to have a fixed width record because you don't have to do a record-oriented search. Instead you can just do a byte-oriented search and make sure that you realign to keys whenever you do a seek. Here's a (probably buggy) example of how to modify the solution you linked to from record-oriented to byte-oriented:
bytes = 24935502 # number of entries
for i, search in enumerate(list): # list contains the list of search keys
left, right = 0, bytes - 1
key = None
while key != search and left <= right:
mid = (left + right) / 2
fin.seek(mid)
# now realign to a record
if mid:
fin.readline()
key, value = map(int, fin.readline().split())
if search > key:
left = mid + 1
else:
right = mid - 1
if key != search:
value = None # for when search key is not found
search.result = value # store the result of the search
Upvotes: 8