Jeff_hk
Jeff_hk

Reputation: 421

Speed issue: Creating Series with Deedle/Getting unique values in F#

I have used the solution suggested by Tomas Petricek in this question: Count unique in a Deedle Series

I have done a quick test python python and the solution above. I have slightly modified the function suggested by Tomas to sort the counts in the reverse order to match the output on the Series.count_values() from the Python function.

let unique s = 
    s |> Series.groupInto (fun _ v -> v) (fun _ g -> Stats.count g)
      |> Series.sortBy (fun v -> -v)

When I execute the following code in F# Interactive

let rand = Random()
let l = [|1..1_000_000|] |> Array.map (fun _ -> rand.Next(1000))
                         |> Series.ofValues
                         |> unique

And using "#time" I get execution of around 1500ms in average (150ms for just creating the random Series).

I have also tested a similar code (in the Python console of PyCharm using Python 3.7)

import time
import pandas
start = time.time()
df = pandas.DataFrame(np.random.randint(0,1000,size=(1000000, 1)), columns=list('A'))
a = df['A'].value_counts()
print(a)
print("ms: %", 1000*(time.time()-start))

And I get, for the creation of the DataFrame + value_counts() around 40ms (half half for each step).

Any tip on how, at least, fasten the creation of the Series in F#? My code may not be the most efficient and I would like to know what I could do. I am trying to switch the mood in my team to switch some research from Python to F# and I do not want to hear from them that F# is way to slow. Thank you all!

Upvotes: 3

Views: 149

Answers (1)

I know neither pandas nor deedle but I suspect the generic aggregation in Deedle don't keep up with a potentially specialized version in pandas (or pandas might just be optimized more thoroughly).

One approach is to perform the counting of values before passing the observations to Deedle.

For example:

let rand = Random ()
let variant () = 
  Array.init 1_000_000 (fun _ -> rand.Next(1000))
  |> Array.groupBy id
  |> Array.map (fun (k, vs) -> (k, vs.Length))
  |> Array.sortBy (fun (_, c) -> -c)
  |> Series.ofObservations

When comparing the original code with variant above it I got the following numbers.

Original took: 1197 ms with (60, 30, 11) cc
Variant took: 56 ms with (2, 0, 0) cc

So the array variant seemes significantly faster as well as creating less GC pressure.

Full code sample

open System
open System.Diagnostics
open System.Linq

open Deedle

let now =
  let sw = Stopwatch ()
  sw.Start ()
  fun () -> sw.ElapsedMilliseconds

let time a =
  let inline cc i       = GC.CollectionCount i
  GC.Collect (2, GCCollectionMode.Forced)
  GC.WaitForFullGCComplete () |> ignore
  let before            = now ()
  let bcc0, bcc1, bcc2  = cc 0, cc 1, cc 2
  let v                 = a ()
  let acc0, acc1, acc2  = cc 0, cc 1, cc 2
  let after             = now ()
  v, after - before, (acc0 - bcc0, acc1 - bcc1, acc2 - bcc2)

let seed = 982301576

let run () =
  let rand = Random seed
  let original () = 
    [|1..1_000_000|] 
    |> Array.map (fun _ -> rand.Next(1000))
    |> Series.ofValues
    |> Series.groupInto (fun _ v -> v) (fun _ g -> Stats.count g)
    |> Series.sortBy (fun v -> -v)

  let _, ms, cc = time original
  printfn "Original took: %d ms with %A cc" ms cc

  let rand = Random seed
  let variant () = 
    Array.init 1_000_000 (fun _ -> rand.Next(1000))
    |> Array.groupBy id
    |> Array.map (fun (k, vs) -> (k, vs.Length))
    |> Array.sortBy (fun (_, c) -> -c)
    |> Series.ofObservations

  let _, ms, cc = time variant
  printfn "Variant took: %d ms with %A cc" ms cc

[<EntryPoint>]
let main argv =
  run ()
  0

Upvotes: 3

Related Questions