Performance of looping over array of dicts in Julia

Question

I am doing loops over arrays/vector of dicts. My Julia performance is slower than Python. Run time is 25% longer than Python.

The toy program below represents the structure of what I am doing.

const poems = [
    Dict(
        :text => "Once upon a midnight dreary, while I pondered, weak and weary, Over many a quaint and curious volume of forgotten lore—",
        :author => "Edgar Allen Poe"),
    Dict(
        :text => "Because I could not stop for Death – He kindly stopped for me – The Carriage held but just Ourselves – And Immortality.",
        :author => "Emily Dickinson"),
    # etc... 10,000 more    
]

const remove_count = [
    Dict(:words => "the", :count => 0),
    Dict(:words => "and", :count => 0),
    # etc... 100 more
]

function loop_poems(poems, remove_count)
    for p in poems
        for r in remove_count
            if occursin(r[:words], p[:text])
                r[:count] += 1
            end
        end
    end
end

How do I optimize? I have read through Performance Tips in Julia website:

First, I declare constants.

Second, I assume since I pass arguments remove count and poems into the function, I don't need to declare as global.

Third, the meat of the processing (the loops) are in a function.

Fourth... I don't know how to declare types in an array of dicts (specifically for performance). How to do it? Would it help performance?

Simon Schoelly · Accepted Answer

The issue here seems to be, what we call "type instability". Julia code is fast, when Julia can figure out the correct types at runtime, but is slower when the types are not known. To figure out, if there is any kind of type instability, you can use the @code_warntype macro in your REPL:

julia> @code_warntype loop_poems(poems, remove_count)

StackOverflow does not show the colors of the output, but you should look out for red parts, that indicate that Julia cannot narrow down the types enough. Yellow parts also indicate places, where the type is not known exactly, but these parts are often intentionally, so we have to worry less about them.

In my case (Julia v1.8.5) the following lines have some red color

│   %17 = Base.getindex(r, :words)::Any
│   %18 = Base.getindex(p, :text)::String
│   %19 = Main.occursin(%17, %18)::Bool
└──       goto #5 if not %19
4 ─ %21 = Base.getindex(r, :count)::Any
│   %22 = (%21 + 1)::Any

the suffixes ::Any indicate that Julia could only infer the type Any here, which could be any type.

We also see that this happens in the cases of Base.getindex(r, :words) and Base.getindex(r, :count) - these are just the de-sugared expressions r[:words] and r[:count].

So why is that the case? If we look at the type of remove_count with

julia> typeof(remove_count)
Vector{Dict{Symbol, Any}}

We see that that the key type of the dictionary can only be a Symbol - but the value type can be any kind of type. We can get a very moderate speed up by constructing remove_count so that we narrow down the value type to a union:

const remove_count = Dict{Symbol, Union{String, Int}[
    Dict(:words => "the", :count => 0),
    Dict(:words => "and", :count => 0),
    # etc... 100 more
]

Running @code_warntype again, shows, that we still have some red entries, but this time at least they are of type Union{String, Int} - but the speed up is still disappointing.

As others have pointed out, it might be better, if you find a different data structure so that your code is type stable.

There are multiple ways to do that. Probably the easiest is to use a vector of NamedTuple:

const remove_count = [
    (word="the", count=0),
    (word="and", count=0),
    # etc... 100 more
)

so that

typeof(remove_count)
Vector{NamedTuple{(:word, :count), Tuple{String, Int64}}}

and your function then becomes

function loop_poems(poems, remove_count)
    for p in poems
        for i in eachindex(remove_count)
            word = remove_count[i].word
            count = remove_count[i].count
            if occursin(word, text)
                # NamedTuple is immutable, so wee need to create a new one
                remove_count[i] = (word=word, count=count + 1)     
            end
        end
    end
end

If we use @code_warntype again, the red parts have disappeared.

There are few other easy improvements:

Use the @inbounds macro when looping over arrays: https://docs.julialang.org/en/v1/devdocs/boundscheck/#Eliding-bounds-checks
Move p[:text] into the outer loop

and your function then becomes:

function loop_poems(poems, remove_count)
    @inbounds for p in poems
        text = p[:text]
        for i in eachindex(remove_count)
            word = remove_count[i].word
            count = remove_count[i].count
            if occursin(word, text)
                # NamedTuple is immutable, so wee need to create a new one
                remove_count[i] = (word=word, count=count + 1)
            end
        end
    end
end

It also might make sense to also convert poems into a vector of NamedTuple.

Ultimately, if you still need more performance, you might better look at your domain and at more complex string algorithms:

Can your words contain any white spaces? If not, maybe split the poems into tokens.
If you have a lot of words, your words might share a lot of prefixes - in such a case a trie might be of help: https://en.wikipedia.org/wiki/Trie

Performance of looping over array of dicts in Julia

Answers (2)

Related Questions