F# - fsc.exe hangs up on huge file

Question

I run some organic chemistry models. A model is described by a generated ModelData.fs file, e.g.: https://github.com/kkkmail/ClmFSharp/blob/master/Clm/Model/ModelData.fs . The file has a very simple structure and using a generated model file is the only way that it can possibly work.

The referenced file is just for tests, but the real models are huge and may go close to 60 - 70 MB / 1.5M LOC. When I try to compile such files, F# compiler,fsc.exe, just hangs up and never comes back. It "eats" about 1.5 GB of memory and then does something forever at near 100% processing capacity. It can clearly handle smaller models, which take about 10 MB in under about a minute. So somewhere between 10 MB and 70 MB something breaks down badly in fsc.

I wonder if there are some parameter tweaks that I could make to the way the fsc compiles the project in order to make it capable of handling such huge models.

The huge models that I am referring to have one parameter set as follows: let numberOfSubstances = 65643. This results in various generated arrays of that size. I wonder if this could be the source of the problem.

Thanks a lot!

Fyodor Soikin · Accepted Answer

I don't think you need to autogenerate all of that.

From your comments, I understand that the functions d0, d1, ... are generated from a big sparse matrix in a way that sums up all of the input array x (with coefficients), but crucially skips summing up zero coefficients, which gives you a great performance gain, because the matrix is huge. Would that be a correct assessment?

If so, I still don't think you need to generate code to do that.

Let's take a look. I will assume that your giant sparse matrix has an interface for obtaining cell values, and it looks something like this:

let getMatrixCell (i: int) (j: int) : double
let maxI: int
let maxJ: int

Then your autogeneration code might look something like this:

let generateDFunction (i: int) =
    printfn "let d%d (x: double[]) =" i
    printfn "    [|"
    for j in 0..maxJ do
        let cell = getMatrixCell i j
        if cell <> 0 then
            printfn "        %f * x.[%d]" cell j
    printfn "    |]"
    printfn "    |> Array.sum"

Which would result in something like this:

let d25 (x : array) = 
    [|
        -1.0 * x.[25]
        1.0 * x.[3]
    |]
    |> Array.sum

Note that I am simplifying here: in your example file, it looks like the functions also multiply negative coefficients by x.[i]. But maybe I'm also overcomplicating, because it looks like all the coefficients are always either 1 or -1. But that is all nonessential to my point.

Now, in the comments, it has been proposed that you don't generate functions d0, d1, ... but instead work directly with the matrix. For example, this would be a naive implementation of such suggestion:

let calculateDFunction (i: int) (x: double[]) =
    [| for j in 0..maxJ -> (getMatrixCell i j) * x.[j] |] |> Array.sum

You then argued that this solution would be prohibitively slow, because it always iterates over the whole array x, which is huge, but most of the coefficients are zero, so it doesn't have to.

And then your way of solving this issue was to use an intermediate step of generated code: you generate the functions that only touch non-zero indicies, and then you compile and use those functions.

But here's the point: yes, you do need that intermediate step to get rid of non-zero indicies, but it doesn't have to be generated-and-compiled code!

Instead, you can prepare lists/arrays of non-zero indicies ahead of time:

let indicies = 
    [| for i in 0..maxI ->
        [ for j in 0..maxJ do
            let cell = getMatrixCell i j
            if cell <> 0 then yield (j, cell)
        ]
    |]

This will yield an array indicies : Array, where each index k corresponds to your autogenerated function dk, and it contains a list of non-zero matrix indicies together with their values in the matrix. For example, the function d22 I gave above would be represented by the 22nd element of indicies:

indicies.[22] = [ (25, -1.0), (3, 1.0) ]

Based on this intermediate structure, you can then calculate any function dk:

let calculateDFunction (k: int) (x: double[]) =
    [| for (j, coeff) in indicies.[k] -> coeff * x.[j] |] |> Array.sum

In fact, if performance is crucial to you (as it seems to be from the comments), you probably should do away with all those intermediate arrays: hundreds or thousands heap allocations on each iteration is definitely not helping. You can sum with a mutable variable instead:

let calculateDFunction (k: int) (x: double[]) =
    let sum = 0.0
    for (j, coeff) in indicies.[k] do
        sum <- sum + coeff * x.[j]
    sum

F# - fsc.exe hangs up on huge file

Answers (1)

Related Questions