Iterate over vector of vectors of Strings without using for loops in Julia

Question

Given a vector of vectors of strings, like:

sentences = [ ["Julia", "is", "1000x", "faster", "than", "Python!"], 
              ["Julia", "reads", "beautiful!"], 
              ["Python", "has", "600", "times", "more", "libraries"] 
]

I'm trying to filter out some tokens in each of them, without losing the outer vector structure (i.e., without flattening the vector down to a single list of tokens).

So far I've achieved this using a classic for loop:

number_of_alphabetical_tokens = []
number_of_long_tokens = []
total_tokens = []

for sent in sentences
    append!(number_of_alphabetical_tokens, length([token for token in sent if all(isletter, token)]))
    append!(number_of_long_words, length([token for token in sent if length(token) > 2]))
    append!(total_tokens, length(sent))
end

collect(zip(number_of_alphabetical_tokens, number_of_long_words, total_tokens))

output: (edited as per @shayan observation)

3-element Vector{Tuple{Any, Any, Any}}:
 (4, 5, 6)
 (2, 3, 3)
 (5, 6, 6)

This gets the job done, but it takes more time than I'd like (I have 6000+ documents, with thousands of sentences each...), and it looks a bit like an antipattern.

Is there a way of doing this with comprehensions or broadcasting (or any more performant method)?

DNF · Accepted Answer

There's no reason to avoid loops for performance reasons in Julia. Loops are fast, and vectorized code is just loops in disguise.

Here's an example of doing this with loops, and some reductions, like all and count:

function wordstats(sentences)
    out = Vector{NTuple{3, Int}}(undef, length(sentences))
    for (i, sent) in pairs(sentences)
        a = count(all(isletter, word) for word in sent)
        b = count(length(word)>2 for word in sent)
        c = length(sent)
        out[i] = (a, b, c)
    end
    return out
end

The above code is not optimized, for example, counting words longer than 2 can be improved, but it runs in approximately 700ns on my laptop, which is much faster than the vectorized solution.

Edit: Here's basically the same code, but using the map do syntax (so you don't have to figure out the return type):

function wordstats2(sentences)
    map(sentences) do sent
        a = count(all(isletter, word) for word in sent)
        b = count(length(word)>2 for word in sent)
        c = length(sent)
        return (a, b, c)
    end
end

Iterate over vector of vectors of Strings without using for loops in Julia

Answers (2)

Additional explanation

Related Questions