Bit Manipulator
Bit Manipulator

Reputation: 358

bring n lines of file to memory buffer

I have to read and process 50 GB of file, and want to do it chunk by chunk say with buffer size of 5 GB. The problem is each row is of different format with different number of parameters. A sample snippet:

4 A 5 7
1 2 B 7 9 10
1 3 B 14 755 9874
5 A 2 7
...

So, cant do directly fread(. . .) giving read size = 5GB as that would probably end in between of a number. So, I want to read maximum number of lines to buffer from file, but ending at '\n'.

A possible solution could be to read say 1000 bytes less than 5 GB on first read, and keep iterating to read the file, setting the seek to start of file, increasing one byte each time till the last read byte is '\n'. But this solution will take much more reads, so wanted to know if there is some more optimal solution?


EDIT:

I use this simple code:

#include <iostream> 
#include <cstdio> 
using namespace std; 

int main()
{
    FILE* fp = fopen("outit", "r");
    char *s = new char[1000];
    fread(s,1,1000,fp);
    cout<<s;
} 

A small sample file has only these lines:

this is a line
this is another line
again another one
more another

But still, the output is:

this is a line
this is another line 
again another one 
more anotheram Files (x86)\CodeBlocks\MinGW\bin;C
:\WINDOWS\system32;C:\WINDO WS;C:\WINDOW
S\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Progr am Files\Microsoft SQL Server\110\Tools\Binn\;D:\Program Files\MATLAB\R2012b\run time\win64;D:\Program Files\MATLAB\R2012b\bin;C:\Program Files (x86)\Microsoft A SP.NET\ASP.NET Web Pages\v1.0\;C:\Program Files (x86)\Windows Kits\8.0\Windows P erformance Toolkit\;C:\Program Files (x86)\MySQL\MySQL Utilities 1.3.4\

What and why that garbage value coming?

Upvotes: 0

Views: 82

Answers (1)

jrok
jrok

Reputation: 55395

  • Read a fixed amount of data in memory.
  • Find the last '\n' (starting the search from behind). This will be the logical end of your buffer.
  • Remember its position so you can adjust the next read

Edit:

The garbage in the output is because the buffer is initialy unitialized and contains garbage and because there is no terminating NUL character for cout to know when to stop printing.

When you call fread and don't know exactly how much input you'll get, you need to check its return value that tells you how many characters it actually read. You can use it to set NUL terminator accordingly:

int n = fread(s,1,1000,fp);
s[n] = '\0':
cout << s;

Upvotes: 2

Related Questions