dawg
dawg

Reputation: 103754

awk super slow processing many rows but not many columns

While looking into this this question the challenge was to take this matrix:

4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w

And turn into:

4 5
m d
t 7
h 5
r 5
4 1
x c
0 0
6 2       # top of next 2 columns...
6 7
4 2
... each N elements from each row of the matrix -- in this example, N=2...
3 4
4 1
d f
5 9
q w      # last element is lower left of matrix

The OP stated the input was 'much bigger' than the example without specifying the shape of the actual input (millions of rows? millions of columns? or both?)

I assumed (mistakenly) that the file had millions of rows (it was later specified to have millions of columns)

BUT the interesting thing is that most of the awks written were perfectly acceptable speed IF the shape of the data was millions of columns.

Example: @glennjackman posted a perfectly useable awk so long as the long end was in columns, not in rows.

Here, you can use his Perl to generate an example matrix of rows X columns. Here is that Perl:

perl -E '
my $cols = 2**20;    # 1,048,576 columns - the long end
my $rows = 2**3;     # 8 rows
my @alphabet=( 'a'..'z', 0..9 );
my $size = scalar @alphabet;

for ($r=1; $r <= $rows; $r++) {
    for ($c = 1; $c <= $cols; $c++) {
        my $idx = int rand $size;
        printf "%s ", $alphabet[$idx];
    }
    printf "\n";
}' >file

Here are some candidate scripts that turn file (from that Perl script) into the output of 2 columns taken from the front of each row:

This is the speed champ regardless of the shape of input in Python:

$ cat col.py
import sys

cols=int(sys.argv[2])
offset=0
delim="\t"

with open(sys.argv[1], "r") as f:
   dat=[line.split() for line in f]

while offset<=len(dat[0])-cols:
    for sl in dat:
        print(delim.join(sl[offset:offset+cols]))
    offset+=cols

Here is a Perl that is also quick enough regardless of the shape of the data:

$ cat col.pl
push @rows, [@F];
END {
    my $delim = "\t";
    my $cols_per_group = 2;
    my $col_start = 0;
    while ( 1 ) {
        for my $row ( @rows ) {
            print join $delim, @{$row}[ $col_start .. ($col_start + $cols_per_group - 1) ];
        }
        $col_start += $cols_per_group;
        last if ($col_start + $cols_per_group - 1) > $#F;
    }
}

Here is an alternate awk that is slower but a consistent speed (and the number of lines in the file needs to be pre-calculated):

$ cat col3.awk
function join(start, end,    result, i) {
    for (i=start; i<=end; i++)
        result = result $i (i==end ? ORS : FS)
    return result
}

{   col_offset=0
    for(i=1;i<=NF;i+=cols) {
        s=join(i,i+cols-1)
        col[NR+col_offset*nl]=join(i,i+cols-1)
        col_offset++
        ++cnt
    }
}
END { for(i=1;i<=cnt;i++) printf "%s", col[i]

}

And Glenn Jackman's awk (not to pick on him since ALL the awks had the same bad result with many rows):

function join(start, end,    result, i) {
    for (i=start; i<=end; i++)
        result = result $i (i==end ? ORS : FS)
    return result
}
{
    c=0
    for (i=1; i<NF; i+=n) {
        c++
        col[c] = col[c] join(i, i+n-1)
    }
}
END {
    for (i=1; i<=c; i++)
        printf "%s", col[i]  # the value already ends with newline
}

Here are the timings with many columns (ie, in the Perl scrip that generates file above, my $cols = 2**20 and my $rows = 2**3):

echo 'glenn jackman awk'
time awk -f col1.awk -v n=2 file >file1

echo 'glenn jackman gawk'
time gawk -f col1.awk -v n=2 file >file5 

echo 'perl'
time perl -lan columnize.pl file >file2

echo 'dawg Python'
time python3 col.py file 2 >file3

echo 'dawg awk'
time awk -f col3.awk -v nl=$(awk '{cnt++} END{print cnt}' file) -v cols=2 file >file4

Prints:

# 2**20 COLUMNS; 2**3 ROWS

glenn jackman awk
real    0m4.460s
user    0m4.344s
sys 0m0.113s

glenn jackman gawk    
real    0m4.493s
user    0m4.379s
sys 0m0.109s

perl    
real    0m3.005s
user    0m2.774s
sys 0m0.230s

dawg Python    
real    0m2.871s
user    0m2.721s
sys 0m0.148s

dawg awk    
real    0m11.356s
user    0m11.038s
sys 0m0.312s

But transpose the shape of the data by setting my $cols = 2**3 and my $rows = 2**20 and run the same timings:

# 2**3 COLUMNS; 2**20 ROWS

glenn jackman awk
real    23m15.798s
user    16m39.675s
sys 6m35.972s

glenn jackman gawk
real    21m49.645s
user    16m4.449s
sys 5m45.036s

perl    
real    0m3.605s
user    0m3.348s
sys 0m0.228s

dawg Python    
real    0m3.157s
user    0m3.065s
sys 0m0.080s

dawg awk    
real    0m11.117s
user    0m10.710s
sys 0m0.399s

So question:

What would cause the first awk to be 100x slower if the data are transposed to millions of rows vs millions of columns?

It is the same number of elements processed and the same total data. The join function is called the same number of times.

Upvotes: 3

Views: 175

Answers (3)

dawg
dawg

Reputation: 103754

Ed Morton's approach did fix the speed issue.

Here is the awk I wrote that supports variable columns:

$ cat col.awk
{
    for (i=1; i<=NF; i++) {
        vals[++numVals] = $i
        }
    }
    END {
    for(col_offset=0; col_offset + cols <= NF; col_offset+=cols) {
        for (i=1; i<=numVals; i+=NF) {
            for(j=0; j<cols; j++) {
                printf "%s%s", vals[i+j+col_offset], (j<cols-1 ? FS : ORS)
                }
        } 
    }
}

$ time awk -f col.awk -v cols=2 file >file.cols
real    0m5.810s
user    0m5.468s
sys 0m0.339s

This is about 6 seconds for datasets of sizes 220 x 23 and 23 x 220

But MAN it sure is nice to have strong support of arrays of arrays (such as in Perl or Python...)

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203229

String concatenation being saved in a variable is one of the slowest operations in awk (IIRC it's slower than I/O) as you're constantly having to find a new memory location to hold the result of the concatenation and there's more of that happening in the awk scripts as the rows get longer so it's probably all of the string concatenation in the posted solutions that's causing the slowdown.

Something like this should be fast and shouldn't be dependent on how many fields there are vs how many records:

$ cat tst.awk
{
    for (i=1; i<=NF; i++) {
        vals[++numVals] = $i
    }
}
END {
    for (i=1; i<=numVals; i+=2) {
        valNr = i + ((i-1) * NF)        # <- not correct, fix it!
        print vals[valNr], vals[valNr+1]
    }
}

I don't have time right now to figure out the correct math to calculate the index for the single loop approach above (see the comment in the code) so here's a working version with 2 loops that doesn't require as much thought and shouldn't run much if any, slower:

$ cat tst.awk
{
    for (i=1; i<=NF; i++) {
        vals[++numVals] = $i
    }
}
END {
    inc = NF - 1
    for (i=0; i<NF; i+=2) {
        for (j=1; j<=NR; j++) {
            valNr = i + j + ((j-1) * inc)
            print vals[valNr], vals[valNr+1]
        }
    }
}

$ awk -f tst.awk file
4 5
m d
t 7
h 5
r 5
4 1
x c
0 0
6 2
6 7
4 2
6 2
7 1
9 0
a 2
3 2
9 8
9 5
4 2
5 s
2 2
5 6
3 4
1 4
4 8
4 g
5 3
3 4
4 1
d f
5 9
q w

Upvotes: 2

James Brown
James Brown

Reputation: 37394

A play with strings:

$ awk '
{
    a[NR]=$0                                           # hash rows to a
    c[NR]=1                                            # index pointer 
}
END {
    b=4                                                # buffer size for match
    i=NR                                               # row count
    n=length(a[1])                                     # process til the first row is done
    while(c[1]<n)
        for(j=1;j<=i;j++) {                            # of each row
            match(substr(a[j],c[j],b),/([^ ]+ ?){2}/)  # read 2 fields and separators
            print substr(a[j],c[j],RLENGTH)            # output them
            c[j]+=RLENGTH                              # increase index pointer
        }
}' file

b=4 is a buffer that is optimal for 2 single digit fields and 2 single space separators (a b ) as was given in the original question, but if the data is real world data, b should be set to something more suitable. If omitted leading the match line to match(substr(a[j],c[j]),/([^ ]+ ?){2}/) kills the performance for data with lots of columns.

I got times around 8 seconds for datasets of sizes 220 x 23 and 23 x 220.

Upvotes: 1

Related Questions