Reputation:
I am working on a project where I search through a large text file (large is relative, file size is about 1 Gig) for a piece of data. I am looking for a token and I want a dollar value immediately after that token. For example,
this is the token 9,999,999.99
So here's is how I am approaching this problem. After a little analysis it appears that the token is usually near the end of the file so I thought I would start searching from the end of the file. Here is the code I have so far (vb.net):
Dim sToken As String = "This is a token"
Dim sr As New StreamReader(sFileName_IN)
Dim FileSize As Long = GetFileSize(sFileName_IN)
Dim BlockSize As Integer = CInt(FileSize / 1000)
Dim buffer(BlockSize) As Char
Dim Position As Long = -BlockSize
Dim sBuffer As String
Dim CurrentBlock As Integer = 0
Dim Value As Double
Dim i As Integer
Dim found As Boolean = False
While Not found And CurrentBlock < 1000
CurrentBlock += 1
Position = -CurrentBlock * BlockSize
sr.BaseStream.Seek(Position, SeekOrigin.End)
i = sr.ReadBlock(buffer, 0, BlockSize)
sBuffer = New String(buffer)
found = SearchBuffer(sBuffer, sToken, Value)
End While
GetFileSize is a function that returns the filesize. SearchBuffer is a function that will search a string for the token. I am not familiar with regular expressions but will explore it for that function.
Basically I read in a small chunk of the file search it and if I don't find it load another chunk and so on...
Am I on the right track or is there a better way?
Upvotes: 1
Views: 3338
Reputation: 1
"What if the token is broken between two chunks? Have you considered this?"
Have done this just recently. I saved the CurrentBlock into a PreviousBlock, before overwriting the CurrentBlock, then marry the two Blocks and check if no joy in finding the search term you are looking for! Works well. The search term can't escape, unless the search term is bigger than the length of the block.
Upvotes: 0
Reputation: 411
You could always search through the file using a FileStream (or continue doing it your way, your choice). If you decide to use the FileStream approach then what you would want to do is something like this:
Dim stream As New FileStream("something.txt")
Dim findBytes As [Byte]() = BitConverter.GetBytes("whatever")
Dim f As Integer = 0
' remaining = Length - Position
While stream.Length - stream.Position > 0
If stream.ReadByte() = findBytes(f) Then
If ++f >= findBytes.Length Then
Console.WriteLine(stream.Position)
Exit While
End If
Else
f = 0
End If
End While
Just to note that I used a c# to vb converter because I don't like vb.
The basic idea applies to just searching the block for a string. It's pretty simple if you want to add reading in blocks.
Upvotes: 0
Reputation: 20780
Wait you people...
What if the token is broken between two chunks? Have you considered this?
Upvotes: 1
Reputation: 39500
If you wanted to do something more complicated but possibly faster, then you could look at reading the blocks asynchronously, so that you're searching one while the next is loading.
That way you get to perform the search at the same time as the data is chugging into memory.
I have to say though that unless your search is very expensive, disk read time will probably completely dominate this, and so complicated overlapping won't be worth the additional complexity.
Upvotes: 0
Reputation: 39500
If you're going to use chunks, it would be wise to use blocks which are multiples of 512 bytes long, and seek on a 512 byte alignment, because that will tend to be more efficient in accessing the disk (which ultimately will be in 512 byte blocks).
There may be other granularities even better than that, but 512 would be a good start.
Upvotes: 0
Reputation: 405775
I think you've got the right idea in chunking the file. You may want to read chunks in at line breaks rather than a set number of bytes, though. In your current implementation, if the token lies on a 1000 byte boundary it could get cut in half, preventing you from finding it. The same thing could cause the data to be cut off as well.
Upvotes: 2