Count distinct values in array - performance tips

Question

I'm having some issues optimizing a go map.
I want to generate a frequency table (count distinct occurrences) in an array of strings. My code holds nicely for small arrays, but as I start working with 100k+ structures -with many distinct values- it just isn't performant enough.

Right now, my approach is to generate an array with the distinct values, compare values and increasing the counter variable (mapped to the string).

    counter := make( map[string]int )    
    for _, distinct := range distinctStrArray{
        for _, row := range StrArray{
            if (row == distinct){
                counter[distinct]++
            }  
        } 
    }

I've tried another approach, where with the input array previously sorted (to minimize the number of changes to the map). This is a bit faster.

    count:=0
    for _, distinct := range distinctStrArray{
        for _, row := range StrArray{
            if (row == distinct){
                count++
            }  
        } 
    counter[distinct] += count
    count= 0
    }

Do you have any suggestion of what I can do to optimize a simple count(distinct) type problem...? I'm open to anything.
thanks!

Adrian · Accepted Answer

Without more context, I would dump the separate array of distinct values - generating it takes time, and using it necessitates the nested loop. Assuming there's no other purpose to the second array, I'd use something like:

counter := make( map[string]int )    
for _, row := range StrArray {
    counter[row]++
}

If you need the list of distinct strings without the counts for some separate purpose, you can easily get it afterward:

distinctStrings := make([]string, len(counter))
i := 0
for k := range counter {
    distinctStrings[i] = k
    i++
}

Iterating the array of distinct strings is O(n), while map access by key is O(log(n)). That takes your overall from O(n^2) to O(n*log(n)), which should be a significant improvement with larger datasets. But, as with any optimization: test, measure, analyze, optimize.

Count distinct values in array - performance tips

Answers (1)

Related Questions