MikeS159
MikeS159

Reputation: 1964

Binary search on file with different line length

I have some code which does a binary search over a file with sorted hex values (SHA1 hashes) on each line. This is used to search the HaveIBeenPwned database. The latest version contains a count of the number of times each password hash was found, so some lines have extra characters at the end, in the format ':###'

The length of this additional check isn't fixed, and it isn't always there. This causes the buffer to read incorrect values and fail to find values that actually exist.

Current code:

static bool Check(string asHex, string filename)
{
    const int LINELENGTH = 40;  //SHA1 hash length

    var buffer = new byte[LINELENGTH];
    using (var sr = File.OpenRead(filename))
    {
        //Number of lines
        var high = (sr.Length / (LINELENGTH + 2)) - 1;
        var low = 0L;

        while (low <= high)
        {
            var middle = (low + high + 1) / 2;
            sr.Seek((LINELENGTH + 2) * ((long)middle), SeekOrigin.Begin);
            sr.Read(buffer, 0, LINELENGTH);
            var readLine = Encoding.ASCII.GetString(buffer);
            switch (readLine.CompareTo(asHex))
            {
                case 0:
                    return true;

                case 1:
                    high = middle - 1;
                    break;

                case -1:
                    low = middle + 1;
                    break;

                default:
                    break;
            }
        }
    }
    return false;
}

My idea is to seek forward from the middle until a newline character is found, then seek backwards for the same point, which should give me a complete line which I can split by the ':' delimiter. I then compare the first part of the split string array which should be just a SHA1 hash.

I think this should still centre on the correct value, however I am wondering if there is a neater way to do this? If the midpoint isn't that actual midpoint between the end of line characters, should it be adjusted before the high and low values are?

Upvotes: 2

Views: 255

Answers (2)

enocknitti
enocknitti

Reputation: 31

My way of solving your problem was to create a new binary file containing the hashes only. 16 byte/hash and a faster binary search  ( I don't have 50 reps needed to comment only )

Upvotes: 0

Marker
Marker

Reputation: 977

I THINK this may be a possible simpler (faster) solution without the backtracking to the beginning of the line. I think you can just use byte file indexes instead of trying to work with a full "record/line. Because the middle index will not always be at the start of a line/record, the "readline" can return a partial line/record. If you were to immediately do a second "readline", you would get a full line/record. It wouldn't be quite optimal, because you would actually be comparing a little ahead of the middle index.

I downloaded the pwned-passwords-update-1 and pulled out about 30 records at the start, end, and in the middle, it seemed to find them all. What do you think?

const int HASHLENGTH = 40;

static bool Check(string asHex, string filename)
{
    using (var fs = File.OpenRead(filename))
    {
        var low = 0L;
        // We don't need to start at the very end
        var high = fs.Length - (HASHLENGTH - 1); // EOF - 1 HASHLENGTH

        StreamReader sr = new StreamReader(fs);

        while (low <= high)
        {
            var middle = (low + high + 1) / 2;
            fs.Seek(middle, SeekOrigin.Begin);

            // Resync with base stream after seek
            sr.DiscardBufferedData();

            var readLine = sr.ReadLine();

            // 1) If we are NOT at the beginning of the file, we may have only read a partial line so
            //    Read again to make sure we get a full line.
            // 2) No sense reading again if we are at the EOF
            if ((middle > 0) && (!sr.EndOfStream)) readLine = sr.ReadLine() ?? "";

            string[] parts = readLine.Split(':');
            string hash = parts[0];

            // By default string compare does a culture-sensitive comparison we may not be what we want?
            // Do an ordinal compare (0-9 < A-Z < a-z)
            int compare = String.Compare(asHex, hash, StringComparison.Ordinal);

            if (compare < 0)
            {
                high = middle - 1;
            }
            else if (compare > 0)
            {
                low = middle + 1;
            }
            else
            {
                return true;
            }
        }
    }
    return false;
}

Upvotes: 2

Related Questions