Reputation: 18707
Here's my way of calculating running count by groups in Sheets:
=LAMBDA(a,INDEX(if(a="",,COUNTIFS(a,a,row(a),"<="&row(a)))))(B4:B)
The complexity of this formula is R^2 = 1000000 operations for 1K rows. I'd love to make more efficient formula, and tried combinations of LABMDA
and SCAN
. For now I've found only the way to do it fast with 1 group at a time:
=INDEX(IF(B4:B="π½ Corn",SCAN(0,B4:B,LAMBDA(i,v,if(v="π½ Corn",i+1,i))),))
Can we do the same for all groups? Do you have an idea?
Note: the script solution would use object and hash
to make it fast.
We have a list of N
items total with m
groups. Group m(i)
is a unique item which may repeat randomly. Samlpe dataset:
a
b
b
b
a
β Sample for 5 items total and 2 groups: N=5
; m=2
. Groups are "a" and "b"
The task is to find the function which will work faster for different numbers of N
and m
:
m(i)
m
N
~ 50K+Samlpe Google Sheet with 50K rows of data. Please click on the button 'Use Tamplate':
Tested solutions:
Countifs
from the question and Countif
and from answer.Xlookup
from answerMatch
logic from answerSorting
logic from the answerIn my enviroment, the sorting
option works faster than other provided solutions. Test results are here, tested with the code from here.
Upvotes: 4
Views: 602
Reputation: 18766
Here's an implementation of kishkin's second approach that offloads much of the lookup table setup to lambdas early on. The changes in logic are not that big, but they seem to benefit the formula quite a bit:
5 uniques | 5000 rows | 4000 rows | 3000 rows | 2000 rows | 1000 rows |
---|---|---|---|---|---|
lambda offload | 14.87x | 14.45x | 10.04x | 10.50x | 7.05x |
sort redux | 7.73x | 5.89x | 4.89x | 3.96x | 2.24x |
max makhrov sort | 4.23x | 4.52x | 3.65x | 3.31x | 1.95x |
array countifs | 2.59x | 2.66x | 2.55x | 2.56x | 2.90x |
kishkin2 | 0.83x | 0.80x | 0.81x | 1.03x | 1.19x |
naΓ―ve countif | 1.00x | 1.00x | 1.00x | 1.00x | 1.00x |
I primarily tested using this benchmark and would welcome testing by others.
=arrayformula(
lambda(
groups,
lambda(
uniques, shiftingFactor,
lambda(
shiftedOrdinals,
lambda(
ordinalLookup,
lambda(
groupLookup,
iferror(
match(
vlookup(groups, groupLookup, 2, true) + row(groups),
ordinalLookup,
1
)
-
vlookup(groups, groupLookup, 3, true)
)
)(
sort(
{
uniques,
shiftedOrdinals,
match(shiftedOrdinals, ordinalLookup, 1)
}
)
)
)(
sort(
{
match(groups, uniques, 1) * shiftingFactor + row(groups);
shiftedOrdinals
}
)
)
)(sequence(rows(uniques)) * shiftingFactor)
)(
unique(groups),
10 ^ int(log10(rows(groups)) + 1)
)
)(A2:A)
)
The formula performs best when the number of groups is small. Here are some benchmark results with a simple numeric 50k row corpus where the number of uniques differs:
50k rows | 11 uniques | 1000 uniques |
---|---|---|
lambda offload | 14.41x | 3.57x |
array countifs | 1.00x | 1.00x |
Performance degrades as the number of groups increases, and I even got a few incorrect results when the number of groups approached 20k.
Upvotes: 2
Reputation: 18707
Sorting
algorithmThe idea is to use SORT
in order to reduce the complexity of the calculation. Sorting is the built-in functionality and it works faster than countifs
.
Data is in range A2:A
=SORT({A2:A,SEQUENCE(ROWS(A2:A))})
C2:C
is a range with sorted groups
=MAP(SEQUENCE(ROWS(A2:A)),LAMBDA(v,if(v=1,0,if(INDEX(C2:C,v)<>INDEX(C2:C,v-1),1,0))))
Count the item of each group by the column of 0/1
values, 1 - where group starts:
=SCAN(0,F2:F,LAMBDA(ini,v,IF(v=1,1,ini+1)))
=SORT(H2:H,D2:D,1)
Suggested by Tom Sharpe:
cut out one stage of the calculation by omitting the map and going straight to a scan like this:
=LAMBDA(a,INDEX(if(a="",, LAMBDA(srt, SORT( SCAN(1,SEQUENCE(ROWS(a)), LAMBDA(ini,v,if(v=1,1,if(INDEX(srt,v,1)<>INDEX(srt,v-1,1),1,ini+1)))), index(srt,,2),1) ) (SORT({a,SEQUENCE(ROWS(a))})))))(A2:A)
β In my tests this solution is faster.
I pack it into the named function. Sample file with the solution: https://docs.google.com/spreadsheets/d/1OSnLuCh-duW4eWH3Y6eqrJM8nU1akmjXJsluFFEkw6M/edit#gid=0
this image explains the logic and the speed of sorting:
β read more about the speed test
Upvotes: 3
Reputation: 18707
Transpose
groups m
= 5I've found a possible way for a small amount of counted groups.
In my tests: 20K rows and 5 groups => cumulative count worked faster with this function:
INDEX(if(B4:B="",,LAMBDA(eq,BYROW(index(TRANSPOSE(SPLIT(TRANSPOSE(BYCOL(eq,LAMBDA(c,query("-"&SCAN(0,c,LAMBDA(i,v,i+v)),,2^99))))," -"))*eq),LAMBDA(r,sum(r))))(--(B4:B=TRANSPOSE(UNIQUE(B4:B))))))
It's ugly, but for now I cannot do a better version as bycol
function does not produce arrays.
The perfect solution would be to have "hash"-like function in google-sheets:
/** runningCount
*
* @param {Range} data
*
* @CustomFunction
*
*/
function runningCount(data) {
var obj = {};
var l = data[0].length;
var k;
var res = [], row;
for (var i = 0; i < data.length; i++) {
row = []
for (var ii = 0; ii < l; ii++) {
k = '' + data[i][ii];
if (k === '') {
row.push('');
} else {
if (!(k in obj)) {
obj[k] = 1;
} else {
obj[k]++;
}
row.push(obj[k]);
}
}
res.push(row);
}
return res;
}
Upvotes: 3
Reputation: 5325
Another approach. Works roughly 4 times faster than the first one.
=LAMBDA(
shift,
ref,
big_ref,
LAMBDA(
base_ref,
big_ref,
ARRAYFORMULA(
IF(
A2:A = "",,
MATCH(VLOOKUP(A2:A, base_ref, 2,) + ROW(A2:A), big_ref,) - VLOOKUP(A2:A, base_ref, 3,)
)
)
)
(
ARRAYFORMULA(
{
ref,
SEQUENCE(ROWS(ref)) * shift,
MATCH(SEQUENCE(ROWS(ref)) * shift, big_ref,)
}
),
big_ref
)
)
(
10 ^ INT(LOG10(ROWS(A:A)) + 1),
UNIQUE(A2:A),
SORT(
{
MATCH(A2:A, UNIQUE(A2:A),) * 10 ^ INT(LOG10(ROWS(A:A)) + 1) + ROW(A2:A);
SEQUENCE(ROWS(UNIQUE(A2:A))) * 10 ^ INT(LOG10(ROWS(A:A)) + 1)
}
)
)
Upvotes: 2
Reputation: 5325
You can try this:
=QUERY(
REDUCE(
{"", 0},
B4:B10000,
LAMBDA(
acc,
cur,
{
acc;
cur, XLOOKUP(
cur,
INDEX(acc, 0, 1),
INDEX(acc, 0, 2),
0,
0,
-1
) + 1
}
)
),
"SELECT Col2 OFFSET 1",
0
)
A bit better than R^2. Works fast enough on 10 000 rows. On 100 000 rows it works, but it is quite slow.
Upvotes: 2
Reputation: 10117
Mmm, it will probably be more efficient, but you'll have to try:
=Byrow(B4:B,lambda(each,if(each="","",countif(B4:each,each))))
or
=map(B4:B,lambda(each,if(each="","",countif(B4:each,each))))
Let me know!
Upvotes: 1