Reputation: 421
I have used the solution suggested by Tomas Petricek in this question: Count unique in a Deedle Series
I have done a quick test python python and the solution above. I have slightly modified the function suggested by Tomas to sort the counts in the reverse order to match the output on the Series.count_values() from the Python function.
let unique s =
s |> Series.groupInto (fun _ v -> v) (fun _ g -> Stats.count g)
|> Series.sortBy (fun v -> -v)
When I execute the following code in F# Interactive
let rand = Random()
let l = [|1..1_000_000|] |> Array.map (fun _ -> rand.Next(1000))
|> Series.ofValues
|> unique
And using "#time" I get execution of around 1500ms in average (150ms for just creating the random Series).
I have also tested a similar code (in the Python console of PyCharm using Python 3.7)
import time
import pandas
start = time.time()
df = pandas.DataFrame(np.random.randint(0,1000,size=(1000000, 1)), columns=list('A'))
a = df['A'].value_counts()
print(a)
print("ms: %", 1000*(time.time()-start))
And I get, for the creation of the DataFrame + value_counts() around 40ms (half half for each step).
Any tip on how, at least, fasten the creation of the Series in F#? My code may not be the most efficient and I would like to know what I could do. I am trying to switch the mood in my team to switch some research from Python to F# and I do not want to hear from them that F# is way to slow. Thank you all!
Upvotes: 3
Views: 149
Reputation: 11577
I know neither pandas nor deedle but I suspect the generic aggregation in Deedle don't keep up with a potentially specialized version in pandas (or pandas might just be optimized more thoroughly).
One approach is to perform the counting of values before passing the observations to Deedle.
For example:
let rand = Random ()
let variant () =
Array.init 1_000_000 (fun _ -> rand.Next(1000))
|> Array.groupBy id
|> Array.map (fun (k, vs) -> (k, vs.Length))
|> Array.sortBy (fun (_, c) -> -c)
|> Series.ofObservations
When comparing the original code with variant above it I got the following numbers.
Original took: 1197 ms with (60, 30, 11) cc
Variant took: 56 ms with (2, 0, 0) cc
So the array variant seemes significantly faster as well as creating less GC pressure.
Full code sample
open System
open System.Diagnostics
open System.Linq
open Deedle
let now =
let sw = Stopwatch ()
sw.Start ()
fun () -> sw.ElapsedMilliseconds
let time a =
let inline cc i = GC.CollectionCount i
GC.Collect (2, GCCollectionMode.Forced)
GC.WaitForFullGCComplete () |> ignore
let before = now ()
let bcc0, bcc1, bcc2 = cc 0, cc 1, cc 2
let v = a ()
let acc0, acc1, acc2 = cc 0, cc 1, cc 2
let after = now ()
v, after - before, (acc0 - bcc0, acc1 - bcc1, acc2 - bcc2)
let seed = 982301576
let run () =
let rand = Random seed
let original () =
[|1..1_000_000|]
|> Array.map (fun _ -> rand.Next(1000))
|> Series.ofValues
|> Series.groupInto (fun _ v -> v) (fun _ g -> Stats.count g)
|> Series.sortBy (fun v -> -v)
let _, ms, cc = time original
printfn "Original took: %d ms with %A cc" ms cc
let rand = Random seed
let variant () =
Array.init 1_000_000 (fun _ -> rand.Next(1000))
|> Array.groupBy id
|> Array.map (fun (k, vs) -> (k, vs.Length))
|> Array.sortBy (fun (_, c) -> -c)
|> Series.ofObservations
let _, ms, cc = time variant
printfn "Variant took: %d ms with %A cc" ms cc
[<EntryPoint>]
let main argv =
run ()
0
Upvotes: 3