Reputation: 35246
I quickly wrote a C program extracting the i-th line of a set of gzipped files (containing about 500,000 lines). Here is my C program:
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <zlib.h>
/* compilation:
gcc -o linesbyindex -Wall -O3 linesbyindex.c -lz
*/
#define MY_BUFFER_SIZE 10000000
static void extract(long int index,const char* filename)
{
char buffer[MY_BUFFER_SIZE];
long int curr=1;
gzFile in=gzopen (filename, "rb");
if(in==NULL)
{
fprintf(stderr,"Cannot open \"%s\" %s.\n",filename,strerror(errno));
exit(EXIT_FAILURE); }
while(gzread(in,buffer,MY_BUFFER_SIZE)!=-1 && curr<=index)
{
char* p=buffer;
while(*p!=0)
{
if(curr==index)
{
fputc(*p,stdout);
}
if(*p=='\n')
{
++curr;
if(curr>index) break;
}
p++;
}
}
gzclose(in);
if(curr<index)
{
fprintf(stderr,"Not enough lines in %s (%ld)\n",filename,curr);
}
}
int main(int argc,char** argv)
{
int optind=2;
char* p2;
long int count=0;
if(argc<3)
{
fprintf(stderr,"Usage: %s (count) files...\n",argv[0]);
return EXIT_FAILURE;
}
count=strtol(argv[1],&p2,10);
if(count<1 || *p2!=0)
{
fprintf(stderr,"bad number %s\n",argv[1]);
return EXIT_SUCCESS;
}
while(optind< argc)
{
extract(count,argv[optind]);
++optind;
}
return EXIT_SUCCESS;
}
As a test, I wrote the following equivalent code in java:
import java.io.*;
import java.util.zip.GZIPInputStream;
public class GetLineByIndex{
private int index;
public GetLineByIndex(int count){
this.index=count;
}
private String extract(File file) throws IOException
{
long curr=1;
byte buffer[]=new byte[2048];
StringBuilder line=null;
InputStream in=null;
if(file.getName().toLowerCase().endsWith(".gz")){
in= (new GZIPInputStream(new FileInputStream(file)));
}else{
in= (new FileInputStream(file));
}
int nRead=0;
while((nRead=in.read(buffer))!=-1)
{
int i=0;
while(i<nRead)
{
if(buffer[i]=='\n')
{
++curr;
if(curr>this.index) break;
}
else if(curr==this.index)
{
if(line==null) line=new StringBuilder(500);
line.append((char)buffer[i]);
}
i++;
}
if(curr>this.index) break;
}
in.close();
return (line==null?null:line.toString());
}
public static void main(String args[]) throws Exception{
int optind=1;
if(args.length<2){
System.err.println("Usage: program (count) files...\n");
return;
}
GetLineByIndex app=new GetLineByIndex(Integer.parseInt(args[0]));
while(optind < args.length)
{
String line=app.extract(new File(args[optind]));
if(line==null)
{
System.err.println("Not enough lines in "+args[optind]);
}
else
{
System.out.println(line);
}
++optind;
}
return;
}
}
It happens that the java program was much faster (~1'45'') to fetch a large index than the C program (~2'15'') on the same machine (I ran that test several times).
How can I explain that difference ?
Upvotes: 10
Views: 1663
Reputation:
The most likely explanation for the Java version to be faster than the C version is that the C version is incorrect.
After fixing the C version, I obtained the following results (contradicting your claim that Java is faster than C):
Java 1.7 -client: 65 milliseconds (after JVM warmed up)
Java 1.7 -server: 82 milliseconds (after JVM warmed up)
gcc -O3: 37 milliseconds
The task was to print the 200000-th line from file words.gz
. File words.gz
was generated by gzipping /usr/share/dict/words
.
...
static char buffer[MY_BUFFER_SIZE];
...
ssize_t len;
while((len=gzread(in,buffer,MY_BUFFER_SIZE)) > 0 && curr<=index)
{
char* p=buffer;
char* endp=buffer+len;
while(p < endp)
{
...
Upvotes: 22
Reputation: 533510
Very large buffers can be slower. I would suggest you make the buffer size the same. i.e. both 2 or 8 KB
Upvotes: 0
Reputation: 2866
I do not have deeper knowledge about what optimizations the compiler performs, but I guess this what makes the difference between your programs. Microbenchmarks like this very, very, very hard to get right and meaningful. Here's an article by Brian Goetz that elaborates on this: http://www.ibm.com/developerworks/java/library/j-jtp02225/index.html
Upvotes: 0
Reputation: 94329
Well your programs are doing different things. I didn't profile your program, but from looking at your code I suspect this difference:
For building the line, you use this in Java:
if(curr==this.index)
{
if(line==null) line=new StringBuilder(500);
line.append((char)buffer[i]);
}
And this in C:
if(curr==index)
{
fputc(*p,stdout);
}
I.e. you're printing one character at a time to stdout. Which is buffere, by default, but I suspect it's still slower than the 500 character buffer you use in Java.
Upvotes: 12
Reputation: 698
Because fputc() isn't very fast and you're adding stuf char-by-char in your output file.
calling fputc_unlocked or rather delimiting the stuff you want to add and call fwrite() should be faster.
Upvotes: 15