jfly
jfly

Reputation: 7990

search a line in preprocessed big text file

I have a data file which contains 100,000+ lines, each line just contains two fields, key and value split by comma, and all the keys are unique. I want to query value by key from this file. Loading it to a map is out of question as that consumes too much memory(code will run on embedded device) and I don't want DB involved. What I do so far is to preprocess the file in my PC, i.e., sort the lines, then use binary search like below in the preprocessed file:

public long findKeyOffset(RandomAccessFile raf, String key)
            throws IOException {
        int blockSize = 8192;
        long fileSize = raf.length();
        long min = 0;
        long max = (long) fileSize / blockSize;
        long mid;
        String line;
        while (max - min > 1) {
            mid = min + (long) ((max - min) / 2);
            raf.seek(mid * blockSize);
            if (mid > 0)
                line = raf.readLine(); // probably a partial line
            line = raf.readLine();
            String[] parts = line.split(",");
            if (key.compareTo(parts[0]) > 0) {
                min = mid;
            } else {
                max = mid;
            }
        }
        // find the right line
        min = min * blockSize;
        raf.seek(min);
        if (min > 0)
            line = raf.readLine();
        while (true) {
            min = raf.getFilePointer();
            line = raf.readLine();
            if (line == null)
                break;
            String[] parts = line.split(",");
            if (line.compareTo(parts[0]) >= 0)
                break;
        }
        raf.seek(min);
        return min;
    }

I think there are better solutions than this. Can anyone give me some enlightenment?

Upvotes: 4

Views: 237

Answers (3)

Andrej Shirinkin
Andrej Shirinkin

Reputation: 57

What about this?

#include <iostream>
#include <fstream>
#include <boost/algorithm/string.hpp>
#include <vector>

using namespace std;

int main(int argc, char *argv[])
{
    ifstream f(argv[1],ios::ate);
    if (!f.is_open())
        return 0;
    string key(argv[2]),value;

    int max = f.tellg();
    int min = 0,mid = 0;
    string s;
    while(max-min>1)
    {
        mid = min + (max - min )/2;
        f.seekg(mid);
        f >> s;
        std::vector<std::string> strs;

        if (!f)
        {
            break;
        }
        if (mid)
        {
            f >> s;
        }
        boost::split(strs, s, boost::is_any_of(","));
        int comp = key.compare(strs[0]);
        if ( comp < 0)
        {
            max = mid;
        }
        else if (comp > 0)
        {
            min = mid;
        }
        else
        {
            value = strs[1];
            break;
        }
    }
    cout<<"key "<<key;
    if (!value.empty())
    {
        cout<<" found! value = "<<value<<endl;
    }
    else
    {
        cout<<" not found..."<<endl;
    }

    f.close();
    return 0;
}

Upvotes: 1

Assafs
Assafs

Reputation: 3275

An easy algorithm to optimise performance for your specific constraints:

  1. let n be the number of lines in the original, immutable, sorted file.
  2. let k < n be a number (we'll discuss ideal number later).
  3. Divide the file into k files, with approximately equal number of lines in each (so each file has n/k lines). the files will be referred to as F1...Fk. If you prefer to keep the original file intact, just consider F1...Fk as line numbers within the file, cutting it into segments.
  4. create a new file called P with k lines, each line i is the first key of Fi.
  5. when looking for a key, first go with binary search over P using O(logk) to find which file /segment (F1...Fk) you need to go to. Then go to that file/segment and search within.
  6. If k is big enough, then size of Fi (n/k) will be small enough to load to a HashMap and retrieve key with O(1). If it is still not practical, do a binary search of O(log(n/k)).

The total search will be O(logk)+O(log(n/k)), which is an improvement on O(logn) which is your original solution.

I would suggest to find a k that would be big enough to allow you to load a specific Fi file/segment into a HashMap, and not too big to fill up space on your device. The most balanced k it sqrt(n), which makes the solution run in O(log(sqrt(n))), but that may be quite a large P file. If you get a k which allows you to load P and Fi into a HashMap for O(1) retrieve, that would be the best solution.

Upvotes: 2

displayName
displayName

Reputation: 14379

Data is immutable and keys are unique (as mentioned in the comments on the question).

A simple solution: Write your own hashing code to map key with the line number.

This means, leave the sorting and instead write your data to the file in the order that your hashing algorithm tells.

When key is queried, you hash the key, get the specific line number and then read the value.

In theory, you have an O(1) solution to your problem.


Ensure that the hashing algorithm has less collision, but I think that depending upon your exact case, a few collisions should be fine. Example: 3 keys map to the same line number so you write all three of them on the same line and when any of the collided keys is searched, you read all 3 entries from that line. Then do a linear (aka O(3) aka constant time in this case) search on the entire line.

Upvotes: 3

Related Questions