Reputation: 7990
I have a data file which contains 100,000+ lines, each line just contains two fields, key and value split by comma, and all the keys are unique. I want to query value by key from this file. Loading it to a map is out of question as that consumes too much memory(code will run on embedded device) and I don't want DB involved. What I do so far is to preprocess the file in my PC, i.e., sort the lines, then use binary search like below in the preprocessed file:
public long findKeyOffset(RandomAccessFile raf, String key)
throws IOException {
int blockSize = 8192;
long fileSize = raf.length();
long min = 0;
long max = (long) fileSize / blockSize;
long mid;
String line;
while (max - min > 1) {
mid = min + (long) ((max - min) / 2);
raf.seek(mid * blockSize);
if (mid > 0)
line = raf.readLine(); // probably a partial line
line = raf.readLine();
String[] parts = line.split(",");
if (key.compareTo(parts[0]) > 0) {
min = mid;
} else {
max = mid;
}
}
// find the right line
min = min * blockSize;
raf.seek(min);
if (min > 0)
line = raf.readLine();
while (true) {
min = raf.getFilePointer();
line = raf.readLine();
if (line == null)
break;
String[] parts = line.split(",");
if (line.compareTo(parts[0]) >= 0)
break;
}
raf.seek(min);
return min;
}
I think there are better solutions than this. Can anyone give me some enlightenment?
Upvotes: 4
Views: 237
Reputation: 57
What about this?
#include <iostream>
#include <fstream>
#include <boost/algorithm/string.hpp>
#include <vector>
using namespace std;
int main(int argc, char *argv[])
{
ifstream f(argv[1],ios::ate);
if (!f.is_open())
return 0;
string key(argv[2]),value;
int max = f.tellg();
int min = 0,mid = 0;
string s;
while(max-min>1)
{
mid = min + (max - min )/2;
f.seekg(mid);
f >> s;
std::vector<std::string> strs;
if (!f)
{
break;
}
if (mid)
{
f >> s;
}
boost::split(strs, s, boost::is_any_of(","));
int comp = key.compare(strs[0]);
if ( comp < 0)
{
max = mid;
}
else if (comp > 0)
{
min = mid;
}
else
{
value = strs[1];
break;
}
}
cout<<"key "<<key;
if (!value.empty())
{
cout<<" found! value = "<<value<<endl;
}
else
{
cout<<" not found..."<<endl;
}
f.close();
return 0;
}
Upvotes: 1
Reputation: 3275
An easy algorithm to optimise performance for your specific constraints:
The total search will be O(logk)+O(log(n/k)), which is an improvement on O(logn) which is your original solution.
I would suggest to find a k that would be big enough to allow you to load a specific Fi file/segment into a HashMap, and not too big to fill up space on your device. The most balanced k it sqrt(n), which makes the solution run in O(log(sqrt(n))), but that may be quite a large P file. If you get a k which allows you to load P and Fi into a HashMap for O(1) retrieve, that would be the best solution.
Upvotes: 2
Reputation: 14379
Data is immutable and keys are unique (as mentioned in the comments on the question).
A simple solution: Write your own hashing code to map key with the line number.
This means, leave the sorting and instead write your data to the file in the order that your hashing algorithm tells.
When key is queried, you hash the key, get the specific line number and then read the value.
In theory, you have an O(1) solution to your problem.
Ensure that the hashing algorithm has less collision, but I think that depending upon your exact case, a few collisions should be fine. Example: 3 keys map to the same line number so you write all three of them on the same line and when any of the collided keys is searched, you read all 3 entries from that line. Then do a linear (aka O(3) aka constant time in this case) search on the entire line.
Upvotes: 3