Reputation: 27
I'm relatively new to C/C++, and am currently (attempting to) use it to parse large, formatted text files containing numerical data into arrays, so as to be able to work with these using the LAPACK Library.
The textfiles I am parsing have a very simple format: a 5 line header followed by 50 values, the next 5 line header and 50 values, repeated about approx. 1 million or so times:
5 line header
1.000000E+00 2.532093E+02
2.000000E+00 7.372978E+02
3.000000E+00 5.690047E+02
My current approach is to use the fscanf function, but I am getting strange results. I'm currently using a very naive approach to skip over the lines containing the header text, but I fear this might be the problem. Or perhaps my use of fscanf is flawed. Here is what I have so far:
int main() {
FILE *ifp;
FILE *ofp;
char mystring[500];
int i,j,n;
//ofp = fopen("newfile.txt","w");
ifp = fopen("results","r");
if (ifp != NULL) {
//Test with 10 result blocks each containing 50 frequency values
float** A = fmatrix(50,10);
for (j=0; j<10; j++) {
//fscanf(ifp, "%*[^\n]\n", NULL);
//fscanf(ifp, "%*[^\n]\n", NULL);
//fscanf(ifp, "%*[^\n]\n", NULL);
//fscanf(ifp, "%*[^\n]\n", NULL);
//fscanf(ifp, "%*[^\n]\n", NULL);
//using fgets w/ printf to see contents of "discarded" lines
fgets(mystring,500,ifp); printf("%s",mystring);
fgets(mystring,500,ifp); printf("%s",mystring);
fgets(mystring,500,ifp); printf("%s",mystring);
fgets(mystring,500,ifp); printf("%s",mystring);
fgets(mystring,500,ifp); printf("%s",mystring);
for (i=0; i<50; i++) {
//skip over first float, store the next float into A[i][j]
n=fscanf(ifp," %*e %E", &A[i][j]);
printf("A[%i][%i]: %E, %i\n",i,j,A[i][j],n);
}
}
}
return 0;
}
float** fmatrix(int m, int n) {
//Return an m x n Matrix
int i;
float** A = (float**)malloc(m*sizeof(float*));
A[0] = (float*)malloc(m*n*sizeof(float));
for (i = 1; i < m; i++) {
A[i] = A[i-1]+n;
}
return A;
}
What I get as a result is curious. I get a 50 component column vector which match up with the result file, then get 50 zeros as the second column vector, and the third column vector corresponds to the second value in my results file, and so on. That is, I get alternating columns of zeros and non-zero values in my matrix. I later inserted the fscanf lines to see what was going on, and to my surprise, some of the lines being discarded were lines which contained numeric data, and not just header lines.
I was hoping someone could maybe have an idea what is, or what could be wrong here? Since this is such a simple format, I really don't even know where the problem could lie. Another related question is: what is the preferred method for skipping over header text? The method I am using is practically single-use only, since any changes in header / file format would render the code worthless. Perhaps use fgets to check whether the format matches the data part of the file, and skip over any lines that do not match the 2-column pattern?
A final question regarding performance: Bugs aside, is fscanf the best way to proceed here? As I mentioned earlier, these files can sometimes have sizes of several hundred million lines, and I'm not at all well enough versed in C/C++ to know if there are faster ways of reading such large amounts of lines into matrices / vectors.
I hope I have provided enough information here to make my question clear. If need be, I can post excerpts of my results files here.
Upvotes: 0
Views: 1106
Reputation: 753755
Because you are not consistently using fgets()
, you read 5 header lines OK, then 50 numbers, but the last number leaves the newline on line 55 ready to be read by the first fgets()
or the next block of header lines. So the second block of header reading reads the newline (only), then 4 header lines, then the data scanning tries to read the last line of the header as a number and (probably) fails.
Always check the return value from every input function (even if it seems to make life painful).
And, I suggest, use fgets()
to read each line. Skip the heading lines; use sscanf()
to convert the data on the data lines. But check both fgets()
and sscanf()
for the correct return values.
There are other functions to convert strings to numbers; strtod()
could be used.
Here's some working code, cut down to work on 5 blocks of data with 10 lines per set (and still 5 header lines):
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
extern float **fmatrix(int m, int n);
enum { HDRS = 5, ROWS = 10, COLS = 5 };
static int read_line(FILE *ifp, char *buffer, size_t buflen)
{
if (fgets(buffer, buflen, ifp) == 0)
{
fprintf(stderr, "EOF\n");
return 0;
}
size_t len = strlen(buffer);
buffer[len-1] = '\0';
printf("[[%s]]\n", buffer);
return 1;
}
int main(void)
{
FILE *ifp;
char mystring[500];
int i, j, n;
ifp = stdin;
if (ifp != NULL)
{
// Test with COLS result blocks each containing ROWS frequency values
float **A = fmatrix(ROWS, COLS);
for (j = 0; j < COLS; j++)
{
// using fgets w/ printf to see contents of "discarded" lines
for (i = 0; i < HDRS; i++)
{
if (read_line(ifp, mystring, sizeof(mystring)) == 0)
break;
}
for (i = 0; i < ROWS; i++)
{
// skip over first float, store the next float into A[i][j]
if (read_line(ifp, mystring, sizeof(mystring)) == 0)
break;
if ((n = sscanf(mystring, " %*e %E", &A[i][j])) != 1)
break;
printf("A[%i][%i]: %E, %i\n", i, j, A[i][j], n);
}
}
for (i = 0; i < ROWS; i++)
{
for (j = 0; j < COLS; j++)
printf("%8.3f", A[i][j]);
putchar('\n');
}
}
return 0;
}
float **fmatrix(int m, int n)
{
// Return an m x n Matrix
int i;
float **A = (float **)malloc(m * sizeof(float *));
A[0] = (float *)malloc(m * n * sizeof(float));
for (i = 1; i < m; i++)
{
A[i] = A[i - 1] + n;
}
return A;
}
Smaller data file:
Line 1 of heading 1
Line 2 of heading 1
Line 3 of heading 1
Line 4 of heading 1
Line 5 of heading 1
18.1815 56.4442
12.0478 15.5530
47.7793 44.5291
30.8319 78.9396
53.5651 28.1290
74.9131 90.5912
34.9319 10.5254
69.7780 56.8633
92.5056 11.8101
82.0158 31.7586
Line 1 of heading 2
Line 2 of heading 2
Line 3 of heading 2
Line 4 of heading 2
Line 5 of heading 2
118.15 564.442
104.78 155.530
477.93 445.291
383.19 789.396
556.51 281.290
791.31 905.912
393.19 105.254
677.80 568.633
950.56 118.101
801.58 317.586
Line 1 of heading 3
Line 2 of heading 3
Line 3 of heading 3
Line 4 of heading 3
Line 5 of heading 3
18.1815 36.4442
12.0478 35.5530
47.7793 34.5291
30.8319 38.9396
53.5651 38.1290
74.9131 30.5912
34.9319 30.5254
69.7780 36.8633
92.5056 31.8101
82.0158 31.7586
Line 1 of heading 4
Line 2 of heading 4
Line 3 of heading 4
Line 4 of heading 4
Line 5 of heading 4
118.15 464.442
104.78 455.530
477.93 445.291
383.19 489.396
556.51 481.290
791.31 405.912
393.19 405.254
677.80 468.633
950.56 418.101
801.58 417.586
Line 1 of heading 5
Line 2 of heading 5
Line 3 of heading 5
Line 4 of heading 5
Line 5 of heading 5
118.15 564.442
104.78 555.530
477.93 545.291
383.19 589.396
556.51 581.290
791.31 505.912
393.19 505.254
677.80 568.633
950.56 518.101
801.58 517.586
Note that the block of 20 random numbers was edited in different ways to get different numbers in each block. There's a strong genetic resemblance between the values in the blocks, though.
Result of running the program on the data file.
[[Line 1 of heading 1]]
[[Line 2 of heading 1]]
[[Line 3 of heading 1]]
[[Line 4 of heading 1]]
[[Line 5 of heading 1]]
[[18.1815 56.4442]]
A[0][0]: 5.644420E+01, 1
[[12.0478 15.5530]]
A[1][0]: 1.555300E+01, 1
[[47.7793 44.5291]]
A[2][0]: 4.452910E+01, 1
[[30.8319 78.9396]]
A[3][0]: 7.893960E+01, 1
[[53.5651 28.1290]]
A[4][0]: 2.812900E+01, 1
[[74.9131 90.5912]]
A[5][0]: 9.059120E+01, 1
[[34.9319 10.5254]]
A[6][0]: 1.052540E+01, 1
[[69.7780 56.8633]]
A[7][0]: 5.686330E+01, 1
[[92.5056 11.8101]]
A[8][0]: 1.181010E+01, 1
[[82.0158 31.7586]]
A[9][0]: 3.175860E+01, 1
[[Line 1 of heading 2]]
[[Line 2 of heading 2]]
[[Line 3 of heading 2]]
[[Line 4 of heading 2]]
[[Line 5 of heading 2]]
[[118.15 564.442]]
A[0][1]: 5.644420E+02, 1
[[104.78 155.530]]
A[1][1]: 1.555300E+02, 1
[[477.93 445.291]]
A[2][1]: 4.452910E+02, 1
[[383.19 789.396]]
A[3][1]: 7.893960E+02, 1
[[556.51 281.290]]
A[4][1]: 2.812900E+02, 1
[[791.31 905.912]]
A[5][1]: 9.059120E+02, 1
[[393.19 105.254]]
A[6][1]: 1.052540E+02, 1
[[677.80 568.633]]
A[7][1]: 5.686330E+02, 1
[[950.56 118.101]]
A[8][1]: 1.181010E+02, 1
[[801.58 317.586]]
A[9][1]: 3.175860E+02, 1
[[Line 1 of heading 3]]
[[Line 2 of heading 3]]
[[Line 3 of heading 3]]
[[Line 4 of heading 3]]
[[Line 5 of heading 3]]
[[18.1815 36.4442]]
A[0][2]: 3.644420E+01, 1
[[12.0478 35.5530]]
A[1][2]: 3.555300E+01, 1
[[47.7793 34.5291]]
A[2][2]: 3.452910E+01, 1
[[30.8319 38.9396]]
A[3][2]: 3.893960E+01, 1
[[53.5651 38.1290]]
A[4][2]: 3.812900E+01, 1
[[74.9131 30.5912]]
A[5][2]: 3.059120E+01, 1
[[34.9319 30.5254]]
A[6][2]: 3.052540E+01, 1
[[69.7780 36.8633]]
A[7][2]: 3.686330E+01, 1
[[92.5056 31.8101]]
A[8][2]: 3.181010E+01, 1
[[82.0158 31.7586]]
A[9][2]: 3.175860E+01, 1
[[Line 1 of heading 4]]
[[Line 2 of heading 4]]
[[Line 3 of heading 4]]
[[Line 4 of heading 4]]
[[Line 5 of heading 4]]
[[118.15 464.442]]
A[0][3]: 4.644420E+02, 1
[[104.78 455.530]]
A[1][3]: 4.555300E+02, 1
[[477.93 445.291]]
A[2][3]: 4.452910E+02, 1
[[383.19 489.396]]
A[3][3]: 4.893960E+02, 1
[[556.51 481.290]]
A[4][3]: 4.812900E+02, 1
[[791.31 405.912]]
A[5][3]: 4.059120E+02, 1
[[393.19 405.254]]
A[6][3]: 4.052540E+02, 1
[[677.80 468.633]]
A[7][3]: 4.686330E+02, 1
[[950.56 418.101]]
A[8][3]: 4.181010E+02, 1
[[801.58 417.586]]
A[9][3]: 4.175860E+02, 1
[[Line 1 of heading 5]]
[[Line 2 of heading 5]]
[[Line 3 of heading 5]]
[[Line 4 of heading 5]]
[[Line 5 of heading 5]]
[[118.15 564.442]]
A[0][4]: 5.644420E+02, 1
[[104.78 555.530]]
A[1][4]: 5.555300E+02, 1
[[477.93 545.291]]
A[2][4]: 5.452910E+02, 1
[[383.19 589.396]]
A[3][4]: 5.893960E+02, 1
[[556.51 581.290]]
A[4][4]: 5.812900E+02, 1
[[791.31 505.912]]
A[5][4]: 5.059120E+02, 1
[[393.19 505.254]]
A[6][4]: 5.052540E+02, 1
[[677.80 568.633]]
A[7][4]: 5.686330E+02, 1
[[950.56 518.101]]
A[8][4]: 5.181010E+02, 1
[[801.58 517.586]]
A[9][4]: 5.175860E+02, 1
56.444 564.442 36.444 464.442 564.442
15.553 155.530 35.553 455.530 555.530
44.529 445.291 34.529 445.291 545.291
78.940 789.396 38.940 489.396 589.396
28.129 281.290 38.129 481.290 581.290
90.591 905.912 30.591 405.912 505.912
10.525 105.254 30.525 405.254 505.254
56.863 568.633 36.863 468.633 568.633
11.810 118.101 31.810 418.101 518.101
31.759 317.586 31.759 417.586 517.586
Upvotes: 1