Garovann
Garovann

Reputation: 43

Optimize looping over a large string to reduce allocations

I am trying to loop over a string in Julia to parse it. I have a DefaultDict inside a struct, containing the number of times I have seen a particular character.

@with_kw mutable struct Metrics
    ...
    nucleotides = DefaultDict{Char, Int64}(0)
    ...
end

I have written a function to loop over a string and increment the value of each character in the DefaultDict.

function compute_base_composition(sequence::String, metrics::Metrics)
    for i in 1:sizeof(sequence)
        metrics.nucleotides[sequence[i]] += 1
    end
end

This function is called in a for loop because I need to do this for multiple strings (which can be up to 2 billions characters long). When I run the @time macro, I get this result:

@time compute_base_composition(sequence, metrics)
  0.167172 seconds (606.20 k allocations: 15.559 MiB, 78.00% compilation time)
  0.099403 seconds (1.63 M allocations: 24.816 MiB)
  0.032346 seconds (633.24 k allocations: 9.663 MiB)
  0.171382 seconds (3.06 M allocations: 46.751 MiB, 4.64% gc time)

As you can see, there are a lot of memory allocations for such a simple function. I have tried to change the for loop to something like for c in sequence but that didn't change much. Would there be a way to reduce them and make the function faster?

Upvotes: 1

Views: 111

Answers (1)

Przemyslaw Szufel
Przemyslaw Szufel

Reputation: 42264

  1. Work on bytes no on unicode chars
  2. Use Vectors not Dicts
  3. Avoid untyped fields in containers
@with_kw struct MetricsB
    nucleotides::Vector{Int}=zeros(Int, 256)
end

function compute_base_composition(sequence::String, metrics::MetricsB)
    bs = Vector{UInt8}(sequence)
    for i in 1:length(bs)
        @inbounds metrics.nucleotides[bs[i]] += 1
    end
end

And a benchmark with a nice speedup of 90x :

julia> st = randstring(10_000_000);

julia> @time compute_base_composition(st, Metrics())
  1.793991 seconds (19.94 M allocations: 304.213 MiB, 3.33% gc time)

julia> @time compute_base_composition(st, MetricsB())
  0.019398 seconds (3 allocations: 9.539 MiB)

Actually you can almost totally avoid allocations with the following code:

function compute_base_composition2(sequence::String, metrics::MetricsB)
    pp = pointer(sequence)
    for i in 1:length(sequence)
        @inbounds metrics.nucleotides[Base.pointerref(pp, i, 1)] += 1
    end
end

and now:

julia> @time compute_base_composition2(st, MetricsB())
  0.021161 seconds (1 allocation: 2.125 KiB)

Upvotes: 1

Related Questions