ndw
ndw

Reputation: 513

Julia - Fastest way to filter based on array of values?

df is an DataFrame of 1.2 Million rows

valid is an Array of 16000 valid values to filter for

I tried using a list comprehension for the filter, but it is extremely slow, because of searching through both arrays.

df[[i in valid for i in df[:match],:]

What is a faster way to do this? Using where? The 'filter' function?

Upvotes: 5

Views: 1386

Answers (1)

Przemyslaw Szufel
Przemyslaw Szufel

Reputation: 42264

Searching over a set will be quite fast:

const validset = Set(valid)
filter((x)-> x.match in validset,df)

Some performance:

julia> df=DataFrame(match=rand(1:(10^8),10^6));

julia> valid = collect(1:1_000_000); validset=Set(valid)

julia> @btime filter((x)-> x.match in $validset,$df)
  173.341 ms (3999506 allocations: 61.30 MiB)

Or the faster version recommended by Bogumil:

julia> @btime filter(:match => in($validset),$df)
  37.500 ms (23 allocations: 282.44 KiB)

Upvotes: 5

Related Questions