Dijkie85
Dijkie85

Reputation: 1106

Iterate over vector of vectors of Strings without using for loops in Julia

Given a vector of vectors of strings, like:

sentences = [ ["Julia", "is", "1000x", "faster", "than", "Python!"], 
              ["Julia", "reads", "beautiful!"], 
              ["Python", "has", "600", "times", "more", "libraries"] 
]

I'm trying to filter out some tokens in each of them, without losing the outer vector structure (i.e., without flattening the vector down to a single list of tokens).

So far I've achieved this using a classic for loop:

number_of_alphabetical_tokens = []
number_of_long_tokens = []
total_tokens = []

for sent in sentences
    append!(number_of_alphabetical_tokens, length([token for token in sent if all(isletter, token)]))
    append!(number_of_long_words, length([token for token in sent if length(token) > 2]))
    append!(total_tokens, length(sent))
end

collect(zip(number_of_alphabetical_tokens, number_of_long_words, total_tokens))

output: (edited as per @shayan observation)

3-element Vector{Tuple{Any, Any, Any}}:
 (4, 5, 6)
 (2, 3, 3)
 (5, 6, 6)

This gets the job done, but it takes more time than I'd like (I have 6000+ documents, with thousands of sentences each...), and it looks a bit like an antipattern.

Is there a way of doing this with comprehensions or broadcasting (or any more performant method)?

Upvotes: 1

Views: 536

Answers (2)

Shayan
Shayan

Reputation: 6355

At first, I guess you have mistakes in writing the final results; for example, you wrote 7 for the number of total tokens in the first element of the sentences while it should be 6 actually.
You can follow such a procedure, fully vectorized:

julia> sentences = [ ["Julia", "is", "1000x", "faster", "than", "Python!"],
                     ["Julia", "reads", "beautiful!"],
                     ["Python", "has", "600", "times", "more", "libraries"]
                   ];

julia> function check_all_letter(str::String)
           all(isletter, str)
       end
check_all_letter (generic function with 1 method)

julia> all_letters = map(x->filter(y->check_all_letter.(y), x), sentences)
3-element Vector{Vector{String}}:
 ["Julia", "is", "faster", "than"]
 ["Julia", "reads"]
 ["Python", "has", "times", "more", "libraries"]

julia> length.(a)
3-element Vector{Int64}:
 4
 2
 5

I can make a similar procedure for number_of_long_words and total_tokens. Wrapping all of it in a function, I'll have:

julia> function arbitrary_name(vec::Vector{Vector{String}})
           all_letters = map(x->filter(check_all_letter, x), sentences)
           long_words = map(x->filter(y->length.(y).>2, x), sentences)
           total_tokens = length.(sentences)

           return collect(zip( length.(all_letters),
                               length.(long_words),
                               total_tokens
                             )
                   )
       end
arbitrary_name (generic function with 1 methods)

julia> arbitrary_name(sentences)
3-element Vector{Tuple{Int64, Int64, Int64}}:
 (4, 5, 6)
 (2, 3, 3)
 (5, 6, 6)

Additional explanation

When I write something like length.(y).>2, In fact, I'm trying to kinda chain some julia functions through vectorization. Consider this example to fully understand what is happening through length.(y).>2:

julia> vec = ["foo", "bar", "baz"];

julia> lengths = length.(vec)
3-element Vector{Int64}:
 3
 3
 3

julia> more_than_two = lengths .> 2
3-element BitVector:
 1
 1
 1

# This is exactly equal to this:
julia> length.(vec).>2
3-element BitVector:
 1
 1
 1

# Or
julia> vec .|> length .|> x->~isless(x, 2)
3-element BitVector:
 1
 1
 1

I hope this help @fandak 🧡. I refer you to official doc for further explanation about broadcasting and chaining functions.

Upvotes: 3

DNF
DNF

Reputation: 12664

There's no reason to avoid loops for performance reasons in Julia. Loops are fast, and vectorized code is just loops in disguise.

Here's an example of doing this with loops, and some reductions, like all and count:

function wordstats(sentences)
    out = Vector{NTuple{3, Int}}(undef, length(sentences))
    for (i, sent) in pairs(sentences)
        a = count(all(isletter, word) for word in sent)
        b = count(length(word)>2 for word in sent)
        c = length(sent)
        out[i] = (a, b, c)
    end
    return out
end

The above code is not optimized, for example, counting words longer than 2 can be improved, but it runs in approximately 700ns on my laptop, which is much faster than the vectorized solution.

Edit: Here's basically the same code, but using the map do syntax (so you don't have to figure out the return type):

function wordstats2(sentences)
    map(sentences) do sent
        a = count(all(isletter, word) for word in sent)
        b = count(length(word)>2 for word in sent)
        c = length(sent)
        return (a, b, c)
    end
end

Upvotes: 3

Related Questions