Reputation: 1612
I need to full outer join two lists, parsed from a file, where each list has > 14 million entries. My first thought was using a simple solution using System.Linq
type AccountingData =
{ AccountingId: AccountingId
Year: int
Month: int
Value: decimal }
let combine xs ys =
let withDefaults (xs: AccountingData seq) (ys: AccountingData seq) =
query {
for x in xs do
leftOuterJoin y in ys on (x.AccountingId = y.AccountingId) into g
for y in g.DefaultIfEmpty() do
select (
if obj.ReferenceEquals(null, y) then
x, { x with Value = 0m }
else
x, y
)
}
Enumerable.Union
(withDefaults xs ys,
withDefaults ys xs |> Seq.map Tuple.swap)
let combined = combine xs ys
But this does not work well for such large data sets and it's extremely slow. Maybe this could be rewritten using a tail recursive function? How?
Upvotes: 0
Views: 84
Reputation: 243041
If you need to process data that is too large to be handled conveniently in memory (especially if you need joins), it might be easier to use a database. However, in this case, I think you should be able to do what you need without a database.
The simplest approach that should be quite efficient would be to use a generic Dictionary<K, V>
as a lookup table, populate this with values from ys
and then find a matching value for each x
. Something like:
let withDefaults (xs: AccountingData seq) (ys: AccountingData seq) =
// Create a lookup for finding 'ys' by accounting Id
let ylookup = Dictionary<_, _>()
for y in ys do ylookup.Add(y.AccountingId, y)
seq {
// Find a corresponding 'y' or default for each 'x'
for x in xs do
match ylookup.TryGetValue(x.AccountingId) with
| true, y -> x, y
| false, _ -> x, { x with Value = 0m } }
It is also worth thinking about whether you want to store data as a lazy sequence (i.e. seq<T>
) or whether something else would be better - a lazy sequence has some nice benefits, but it can easily lead to a situation where you are re-running code multiple times by accident.
Upvotes: 2