zebediah49
zebediah49

Reputation: 7611

Awk skip field-separation for greater speed

I have a reasonably large data set (10K files, each with 20K lines). I need to swap file and line, (giving myself 20K files, each with 10K lines).

I had a solution that combined it all into one massive table, and then extracted the columns with cut.. but cut was taking too long (scanning through a 4GB file 10K times isn't exactly fast, even if the file is sitting in cache).

So I wrote a (surprisingly simple) once-through in awk:

awk '{ print >> "times/"FNR".txt" }' posns/*

This does the job, but is also rather slow (about 10s per input file). My guess is that it is doing field separation, despite the fact that I don't need that at all. Is there a way to disable that feature to speed it up, or am I going to have to write up a solution in yet another language?

If it helps, while I'd prefer a general solution, each line in each file is of the form %d %lf %lf, so lines will be at most 21 bytes in this case (the floats are all less than 100, and the integer is 0 or 1).

Upvotes: 1

Views: 301

Answers (4)

nullrevolution
nullrevolution

Reputation: 4137

i don't know if this is faster than awk or not, but here's a perl script that will accomplish the task:

#!/usr/bin/perl

use strict;
use warnings;

my $line=0;

foreach(@ARGV){

 open (MYINFILE, $_);
 $line=0;

 while(<MYINFILE>){
  $line++;
  open (MYOUTFILE,">>times/$line.txt");
  print MYOUTFILE $_;
  close (MYOUTFILE);
 }

}

Upvotes: 1

zebediah49
zebediah49

Reputation: 7611

Eventually I gave on the pretty shell method, and wrote another version in C. It's sad, it's not pretty, but it's more than three orders of magnitude faster (at a total run time of 43 seconds, compared to an estimated 28 hours for the awk method, given pre-cached data). It requires changing ulimit to allow enough open files, and if your lines are longer than LINE_LENGTH, it will not work correctly.

Still, it runs 2300 times faster than the next best solution.

If someone stumbles upon this looking to do this task, this will do it. Just be careful and check that it actually worked.

    #include <stdio.h>
    #include <stdlib.h>

    #define LINE_LENGTH 1024

    int main(int argc, char* argv[]) {
            int fn;
            int ln;
            char read[LINE_LENGTH];

            int fmax=10;
            int ftot=0;
            FILE** files=malloc(fmax*sizeof(FILE*));
            char fname[255];
            printf("%d arguments\n", argc);

            printf("opening %s\n",argv[1]);
            FILE* open = fopen(argv[1],"r");

            for(ln=0;fgets(read,LINE_LENGTH,open); ln++) {
                    if(ln==fmax) {
                            printf("%d has reached %d; reallocing\n",ln,fmax);
                            fmax*=2;
                            files=realloc(files,fmax*sizeof(FILE*));
                    }
                    sprintf(fname, "times/%09d.txt",ln);
                    files[ln]=fopen(fname,"w");
                    if(files[ln]==0) {
                            fprintf(stderr,"Failed at opening file number %d\n",ln);
                            return 1;
                    }
                    fprintf(files[ln],"%s",read);
            }
            ftot=ln;
            fclose(open);

            for(fn=2;fn<argc;fn++) {
                    printf("working on file %d\n",fn);
                    open=fopen(argv[fn],"r");
                    for(ln=0;fgets(read,LINE_LENGTH,open); ln++) {
                            fprintf(files[ln],"%s",read);
                    }
                    fclose(open);
            }
            for(ln=0;ln<ftot;ln++) {
                    fclose(files[ln]);
            }
            return 0;
    }

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203985

You could try a different awk. I hear mawk is fast than other awks and GNu awk has some performance improvements that mean it might run faster than whatever you're using. If you set your field separator to your record separator then there will only be one field per line so if you're right about field splitting being the issue then maybe that'll speed it up. Also, you're using the wrong redirection operator - you should be using ">" not ">>" and string concatenation is slow so I'd recommend just printing to numbered files and then renaming them all afterwards.

Something like this:

cd times
awk -F'\n' '{ print > FNR }' ../posns/*
for f in *
do
    mv -- "$f" "${f}.txt"
done
cd ..

You might want to test it on a dummy directory first.

wrt other comments in this thread that it might be keeping so many files open simultaneously that's the issue, can you do it to sub-groups based on some pattern in the file names? For example if your posns files all started with a digit:

cd times
rm -f *
for ((i=0; i<=9; i++))
do
   awk -F'\n' '{ print >> FNR }' ../posns/"$i"*
   for f in *
   do
      mv -- "$f" "${f}.txt"
   done
done
cd ..

Note that in that case you would need to zap your output files first. I'm sure there's a better way to group your files than that but you'd need to tell us if there's a naming convention.

Upvotes: 1

sampson-chen
sampson-chen

Reputation: 47317

This sounds like a perfect job for split ;)

find posns -type f -exec split -l 10000 {} \;

You can play with -a and -d options for customizing the result file suffixes.

Explanation:

  • find posns -type f: find all files (recursively) in the directory posns
  • -exec ... \; : for each result found, do the following command ...
  • split -l 10000 {}: the {} is just where the result from find is substituted into when used in conjunction with -exec. split -l 10000 splits the input file into chunks of at most 10k lines each.

Upvotes: 0

Related Questions