Reputation: 1106
Given a vector of vectors of strings, like:
sentences = [ ["Julia", "is", "1000x", "faster", "than", "Python!"],
["Julia", "reads", "beautiful!"],
["Python", "has", "600", "times", "more", "libraries"]
]
I'm trying to filter out some tokens in each of them, without losing the outer vector structure (i.e., without flattening the vector down to a single list of tokens).
So far I've achieved this using a classic for loop:
number_of_alphabetical_tokens = []
number_of_long_tokens = []
total_tokens = []
for sent in sentences
append!(number_of_alphabetical_tokens, length([token for token in sent if all(isletter, token)]))
append!(number_of_long_words, length([token for token in sent if length(token) > 2]))
append!(total_tokens, length(sent))
end
collect(zip(number_of_alphabetical_tokens, number_of_long_words, total_tokens))
output: (edited as per @shayan observation)
3-element Vector{Tuple{Any, Any, Any}}:
(4, 5, 6)
(2, 3, 3)
(5, 6, 6)
This gets the job done, but it takes more time than I'd like (I have 6000+ documents, with thousands of sentences each...), and it looks a bit like an antipattern.
Is there a way of doing this with comprehensions or broadcasting (or any more performant method)?
Upvotes: 1
Views: 536
Reputation: 6355
At first, I guess you have mistakes in writing the final results; for example, you wrote 7
for the number of total tokens in the first element of the sentences
while it should be 6
actually.
You can follow such a procedure, fully vectorized:
julia> sentences = [ ["Julia", "is", "1000x", "faster", "than", "Python!"],
["Julia", "reads", "beautiful!"],
["Python", "has", "600", "times", "more", "libraries"]
];
julia> function check_all_letter(str::String)
all(isletter, str)
end
check_all_letter (generic function with 1 method)
julia> all_letters = map(x->filter(y->check_all_letter.(y), x), sentences)
3-element Vector{Vector{String}}:
["Julia", "is", "faster", "than"]
["Julia", "reads"]
["Python", "has", "times", "more", "libraries"]
julia> length.(a)
3-element Vector{Int64}:
4
2
5
I can make a similar procedure for number_of_long_words
and total_tokens
. Wrapping all of it in a function, I'll have:
julia> function arbitrary_name(vec::Vector{Vector{String}})
all_letters = map(x->filter(check_all_letter, x), sentences)
long_words = map(x->filter(y->length.(y).>2, x), sentences)
total_tokens = length.(sentences)
return collect(zip( length.(all_letters),
length.(long_words),
total_tokens
)
)
end
arbitrary_name (generic function with 1 methods)
julia> arbitrary_name(sentences)
3-element Vector{Tuple{Int64, Int64, Int64}}:
(4, 5, 6)
(2, 3, 3)
(5, 6, 6)
When I write something like length.(y).>2
, In fact, I'm trying to kinda chain some julia functions through vectorization. Consider this example to fully understand what is happening through length.(y).>2
:
julia> vec = ["foo", "bar", "baz"];
julia> lengths = length.(vec)
3-element Vector{Int64}:
3
3
3
julia> more_than_two = lengths .> 2
3-element BitVector:
1
1
1
# This is exactly equal to this:
julia> length.(vec).>2
3-element BitVector:
1
1
1
# Or
julia> vec .|> length .|> x->~isless(x, 2)
3-element BitVector:
1
1
1
I hope this help @fandak 🧡. I refer you to official doc for further explanation about broadcasting and chaining functions.
Upvotes: 3
Reputation: 12664
There's no reason to avoid loops for performance reasons in Julia. Loops are fast, and vectorized code is just loops in disguise.
Here's an example of doing this with loops, and some reductions, like all
and count
:
function wordstats(sentences)
out = Vector{NTuple{3, Int}}(undef, length(sentences))
for (i, sent) in pairs(sentences)
a = count(all(isletter, word) for word in sent)
b = count(length(word)>2 for word in sent)
c = length(sent)
out[i] = (a, b, c)
end
return out
end
The above code is not optimized, for example, counting words longer than 2 can be improved, but it runs in approximately 700ns on my laptop, which is much faster than the vectorized solution.
Edit: Here's basically the same code, but using the map do
syntax (so you don't have to figure out the return type):
function wordstats2(sentences)
map(sentences) do sent
a = count(all(isletter, word) for word in sent)
b = count(length(word)>2 for word in sent)
c = length(sent)
return (a, b, c)
end
end
Upvotes: 3