martin
martin

Reputation:

Best way to search large file for data in .net

I am working on a project where I search through a large text file (large is relative, file size is about 1 Gig) for a piece of data. I am looking for a token and I want a dollar value immediately after that token. For example,

this is the token 9,999,999.99

So here's is how I am approaching this problem. After a little analysis it appears that the token is usually near the end of the file so I thought I would start searching from the end of the file. Here is the code I have so far (vb.net):

    Dim sToken As String = "This is a token"
    Dim sr As New StreamReader(sFileName_IN)

    Dim FileSize As Long = GetFileSize(sFileName_IN)
    Dim BlockSize As Integer = CInt(FileSize / 1000)
    Dim buffer(BlockSize) As Char
    Dim Position As Long = -BlockSize
    Dim sBuffer As String
    Dim CurrentBlock As Integer = 0
    Dim Value As Double

    Dim i As Integer

    Dim found As Boolean = False
    While Not found And CurrentBlock < 1000
        CurrentBlock += 1
        Position = -CurrentBlock * BlockSize

        sr.BaseStream.Seek(Position, SeekOrigin.End)
        i = sr.ReadBlock(buffer, 0, BlockSize)
        sBuffer = New String(buffer)

        found = SearchBuffer(sBuffer, sToken, Value)
    End While

GetFileSize is a function that returns the filesize. SearchBuffer is a function that will search a string for the token. I am not familiar with regular expressions but will explore it for that function.

Basically I read in a small chunk of the file search it and if I don't find it load another chunk and so on...

Am I on the right track or is there a better way?

Upvotes: 1

Views: 3338

Answers (6)

Stuart Kearney
Stuart Kearney

Reputation: 1

"What if the token is broken between two chunks? Have you considered this?"

Have done this just recently. I saved the CurrentBlock into a PreviousBlock, before overwriting the CurrentBlock, then marry the two Blocks and check if no joy in finding the search term you are looking for! Works well. The search term can't escape, unless the search term is bigger than the length of the block.

Upvotes: 0

Yes Man
Yes Man

Reputation: 411

You could always search through the file using a FileStream (or continue doing it your way, your choice). If you decide to use the FileStream approach then what you would want to do is something like this:

Dim stream As New FileStream("something.txt")
Dim findBytes As [Byte]() = BitConverter.GetBytes("whatever")
Dim f As Integer = 0

' remaining = Length - Position
While stream.Length - stream.Position > 0
    If stream.ReadByte() = findBytes(f) Then
        If ++f >= findBytes.Length Then
            Console.WriteLine(stream.Position)
            Exit While
        End If
    Else
        f = 0
    End If
End While

Just to note that I used a c# to vb converter because I don't like vb.

The basic idea applies to just searching the block for a string. It's pretty simple if you want to add reading in blocks.

Upvotes: 0

Andrei R&#238;nea
Andrei R&#238;nea

Reputation: 20780

Wait you people...

What if the token is broken between two chunks? Have you considered this?

Upvotes: 1

Will Dean
Will Dean

Reputation: 39500

If you wanted to do something more complicated but possibly faster, then you could look at reading the blocks asynchronously, so that you're searching one while the next is loading.

That way you get to perform the search at the same time as the data is chugging into memory.

I have to say though that unless your search is very expensive, disk read time will probably completely dominate this, and so complicated overlapping won't be worth the additional complexity.

Upvotes: 0

Will Dean
Will Dean

Reputation: 39500

If you're going to use chunks, it would be wise to use blocks which are multiples of 512 bytes long, and seek on a 512 byte alignment, because that will tend to be more efficient in accessing the disk (which ultimately will be in 512 byte blocks).

There may be other granularities even better than that, but 512 would be a good start.

Upvotes: 0

Bill the Lizard
Bill the Lizard

Reputation: 405775

I think you've got the right idea in chunking the file. You may want to read chunks in at line breaks rather than a set number of bytes, though. In your current implementation, if the token lies on a 1000 byte boundary it could get cut in half, preventing you from finding it. The same thing could cause the data to be cut off as well.

Upvotes: 2

Related Questions