Daniel
Daniel

Reputation: 1612

How to combine large data sets

I need to full outer join two lists, parsed from a file, where each list has > 14 million entries. My first thought was using a simple solution using System.Linq

type AccountingData =
    { AccountingId: AccountingId
      Year: int
      Month: int
      Value: decimal }

let combine xs ys =
    let withDefaults (xs: AccountingData seq) (ys: AccountingData seq) =
        query {
            for x in xs do
            leftOuterJoin y in ys on (x.AccountingId = y.AccountingId) into g
            for y in g.DefaultIfEmpty() do
            select (
                if obj.ReferenceEquals(null, y) then
                    x, { x with Value = 0m }
                else
                    x, y
            )
        }
    Enumerable.Union
        (withDefaults xs ys,
         withDefaults ys xs |> Seq.map Tuple.swap)

let combined = combine xs ys

But this does not work well for such large data sets and it's extremely slow. Maybe this could be rewritten using a tail recursive function? How?

Upvotes: 0

Views: 84

Answers (1)

Tomas Petricek
Tomas Petricek

Reputation: 243041

If you need to process data that is too large to be handled conveniently in memory (especially if you need joins), it might be easier to use a database. However, in this case, I think you should be able to do what you need without a database.

The simplest approach that should be quite efficient would be to use a generic Dictionary<K, V> as a lookup table, populate this with values from ys and then find a matching value for each x. Something like:

let withDefaults (xs: AccountingData seq) (ys: AccountingData seq) =
  // Create a lookup for finding 'ys' by accounting Id
  let ylookup = Dictionary<_, _>()
  for y in ys do ylookup.Add(y.AccountingId, y)
  seq { 
    // Find a corresponding 'y' or default for each 'x'
    for x in xs do
      match ylookup.TryGetValue(x.AccountingId) with
      | true, y -> x, y
      | false, _ -> x, { x with Value = 0m } }

It is also worth thinking about whether you want to store data as a lazy sequence (i.e. seq<T>) or whether something else would be better - a lazy sequence has some nice benefits, but it can easily lead to a situation where you are re-running code multiple times by accident.

Upvotes: 2

Related Questions