Reputation: 103754
While looking into this this question the challenge was to take this matrix:
4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w
And turn into:
4 5
m d
t 7
h 5
r 5
4 1
x c
0 0
6 2 # top of next 2 columns...
6 7
4 2
... each N elements from each row of the matrix -- in this example, N=2...
3 4
4 1
d f
5 9
q w # last element is lower left of matrix
The OP stated the input was 'much bigger' than the example without specifying the shape of the actual input (millions of rows? millions of columns? or both?)
I assumed (mistakenly) that the file had millions of rows (it was later specified to have millions of columns)
BUT the interesting thing is that most of the awks
written were perfectly acceptable speed IF the shape of the data was millions of columns.
Example: @glennjackman posted a perfectly useable awk so long as the long end was in columns, not in rows.
Here, you can use his Perl to generate an example matrix of rows X columns. Here is that Perl:
perl -E '
my $cols = 2**20; # 1,048,576 columns - the long end
my $rows = 2**3; # 8 rows
my @alphabet=( 'a'..'z', 0..9 );
my $size = scalar @alphabet;
for ($r=1; $r <= $rows; $r++) {
for ($c = 1; $c <= $cols; $c++) {
my $idx = int rand $size;
printf "%s ", $alphabet[$idx];
}
printf "\n";
}' >file
Here are some candidate scripts that turn file
(from that Perl script) into the output of 2 columns taken from the front of each row:
This is the speed champ regardless of the shape of input in Python:
$ cat col.py
import sys
cols=int(sys.argv[2])
offset=0
delim="\t"
with open(sys.argv[1], "r") as f:
dat=[line.split() for line in f]
while offset<=len(dat[0])-cols:
for sl in dat:
print(delim.join(sl[offset:offset+cols]))
offset+=cols
Here is a Perl that is also quick enough regardless of the shape of the data:
$ cat col.pl
push @rows, [@F];
END {
my $delim = "\t";
my $cols_per_group = 2;
my $col_start = 0;
while ( 1 ) {
for my $row ( @rows ) {
print join $delim, @{$row}[ $col_start .. ($col_start + $cols_per_group - 1) ];
}
$col_start += $cols_per_group;
last if ($col_start + $cols_per_group - 1) > $#F;
}
}
Here is an alternate awk
that is slower but a consistent speed (and the number of lines in the file needs to be pre-calculated):
$ cat col3.awk
function join(start, end, result, i) {
for (i=start; i<=end; i++)
result = result $i (i==end ? ORS : FS)
return result
}
{ col_offset=0
for(i=1;i<=NF;i+=cols) {
s=join(i,i+cols-1)
col[NR+col_offset*nl]=join(i,i+cols-1)
col_offset++
++cnt
}
}
END { for(i=1;i<=cnt;i++) printf "%s", col[i]
}
And Glenn Jackman's awk
(not to pick on him since ALL the awks had the same bad result with many rows):
function join(start, end, result, i) {
for (i=start; i<=end; i++)
result = result $i (i==end ? ORS : FS)
return result
}
{
c=0
for (i=1; i<NF; i+=n) {
c++
col[c] = col[c] join(i, i+n-1)
}
}
END {
for (i=1; i<=c; i++)
printf "%s", col[i] # the value already ends with newline
}
Here are the timings with many columns (ie, in the Perl scrip that generates file
above, my $cols = 2**20
and my $rows = 2**3
):
echo 'glenn jackman awk'
time awk -f col1.awk -v n=2 file >file1
echo 'glenn jackman gawk'
time gawk -f col1.awk -v n=2 file >file5
echo 'perl'
time perl -lan columnize.pl file >file2
echo 'dawg Python'
time python3 col.py file 2 >file3
echo 'dawg awk'
time awk -f col3.awk -v nl=$(awk '{cnt++} END{print cnt}' file) -v cols=2 file >file4
Prints:
# 2**20 COLUMNS; 2**3 ROWS
glenn jackman awk
real 0m4.460s
user 0m4.344s
sys 0m0.113s
glenn jackman gawk
real 0m4.493s
user 0m4.379s
sys 0m0.109s
perl
real 0m3.005s
user 0m2.774s
sys 0m0.230s
dawg Python
real 0m2.871s
user 0m2.721s
sys 0m0.148s
dawg awk
real 0m11.356s
user 0m11.038s
sys 0m0.312s
But transpose the shape of the data by setting my $cols = 2**3
and my $rows = 2**20
and run the same timings:
# 2**3 COLUMNS; 2**20 ROWS
glenn jackman awk
real 23m15.798s
user 16m39.675s
sys 6m35.972s
glenn jackman gawk
real 21m49.645s
user 16m4.449s
sys 5m45.036s
perl
real 0m3.605s
user 0m3.348s
sys 0m0.228s
dawg Python
real 0m3.157s
user 0m3.065s
sys 0m0.080s
dawg awk
real 0m11.117s
user 0m10.710s
sys 0m0.399s
So question:
What would cause the first awk to be 100x slower if the data are transposed to millions of rows vs millions of columns?
It is the same number of elements processed and the same total data. The join
function is called the same number of times.
Upvotes: 3
Views: 175
Reputation: 103754
Ed Morton's approach did fix the speed issue.
Here is the awk
I wrote that supports variable columns:
$ cat col.awk
{
for (i=1; i<=NF; i++) {
vals[++numVals] = $i
}
}
END {
for(col_offset=0; col_offset + cols <= NF; col_offset+=cols) {
for (i=1; i<=numVals; i+=NF) {
for(j=0; j<cols; j++) {
printf "%s%s", vals[i+j+col_offset], (j<cols-1 ? FS : ORS)
}
}
}
}
$ time awk -f col.awk -v cols=2 file >file.cols
real 0m5.810s
user 0m5.468s
sys 0m0.339s
This is about 6 seconds for datasets of sizes 220 x 23 and 23 x 220
But MAN it sure is nice to have strong support of arrays of arrays (such as in Perl or Python...)
Upvotes: 0
Reputation: 203229
String concatenation being saved in a variable is one of the slowest operations in awk (IIRC it's slower than I/O) as you're constantly having to find a new memory location to hold the result of the concatenation and there's more of that happening in the awk scripts as the rows get longer so it's probably all of the string concatenation in the posted solutions that's causing the slowdown.
Something like this should be fast and shouldn't be dependent on how many fields there are vs how many records:
$ cat tst.awk
{
for (i=1; i<=NF; i++) {
vals[++numVals] = $i
}
}
END {
for (i=1; i<=numVals; i+=2) {
valNr = i + ((i-1) * NF) # <- not correct, fix it!
print vals[valNr], vals[valNr+1]
}
}
I don't have time right now to figure out the correct math to calculate the index for the single loop approach above (see the comment in the code) so here's a working version with 2 loops that doesn't require as much thought and shouldn't run much if any, slower:
$ cat tst.awk
{
for (i=1; i<=NF; i++) {
vals[++numVals] = $i
}
}
END {
inc = NF - 1
for (i=0; i<NF; i+=2) {
for (j=1; j<=NR; j++) {
valNr = i + j + ((j-1) * inc)
print vals[valNr], vals[valNr+1]
}
}
}
$ awk -f tst.awk file
4 5
m d
t 7
h 5
r 5
4 1
x c
0 0
6 2
6 7
4 2
6 2
7 1
9 0
a 2
3 2
9 8
9 5
4 2
5 s
2 2
5 6
3 4
1 4
4 8
4 g
5 3
3 4
4 1
d f
5 9
q w
Upvotes: 2
Reputation: 37394
A play with strings:
$ awk '
{
a[NR]=$0 # hash rows to a
c[NR]=1 # index pointer
}
END {
b=4 # buffer size for match
i=NR # row count
n=length(a[1]) # process til the first row is done
while(c[1]<n)
for(j=1;j<=i;j++) { # of each row
match(substr(a[j],c[j],b),/([^ ]+ ?){2}/) # read 2 fields and separators
print substr(a[j],c[j],RLENGTH) # output them
c[j]+=RLENGTH # increase index pointer
}
}' file
b=4
is a buffer that is optimal for 2 single digit fields and 2 single space separators (a b
) as was given in the original question, but if the data is real world data, b
should be set to something more suitable. If omitted leading the match
line to match(substr(a[j],c[j]),/([^ ]+ ?){2}/)
kills the performance for data with lots of columns.
I got times around 8 seconds for datasets of sizes 220 x 23 and 23 x 220.
Upvotes: 1