Maik Klein
Maik Klein

Reputation: 16148

Avoiding the garbage collection

https://stackoverflow.com/a/15243682/944430

And then there is coding: use unboxed types (no GC), minimize lazy structure allocation. Keep long lived data around in packed form. Test and benchmark.

1.) What are unboxed types? I am pretty sure he is speaking about data types, something like Just x or IO y (boxed). But what about newtypes? If I understood it correctly, newtype has no overhead at all and therefore shouldn't count as a boxed type?

2.) What does he mean by Keep long lived data around in packed form.?

3.) What else can I do to prevent GC pauses?

Upvotes: 2

Views: 892

Answers (2)

user2407038
user2407038

Reputation: 14578

1 . Unboxed types are the primitives in Haskell. For example, Int is defined as: data Int = GHC.Types.I# GHC.Prim.Int# (for the GHC compiler). The trailing # symbol is used to indicate primitives (this is only convention). Primitives don't really exist in Haskell. You can't define additional primitives. When they appear in code, the compiler is responsible for translating them to 'real' function calls (functions can be primitives too) and datatypes.

Yes, newtype does not 'box' a type additionally. But you can't have a newtype containing a primitive - newtype Int2 = Int2 Int# is invalid while data Int2 = Int2 Int# is fine.

The main difference between primitive and boxed types in the context of the question you linked is how they are represented in memory. A primitive type means there are no pointers to follow. A pointer to an Int# must point to the value of the number, whereas a pointer to an Int may point to a thunk which points to a thunk ... etc. Note that this means that primitives are always strict. If you believe this will be an issue, use the UNPACK pragma, which removes any 'intermediate' boxing. That is,

data D = D (Int, Int)

is stored as a pointer (D) to a pointer (the tuple) to a block of memory containing two pointers (Ints) which each point to an actual Int#. However,

data D = D {-# UNPACK #-} !(Int, Int)

is stored as a pointer (D) to two Ints, thereby removing one level of boxing. Note the !. This indicates that that field is strict and is required for UNPACK.

2 . Any data which is going to be called with polymorphic functions should be kept packed, as unpacked data passed to polymorphic functions will be repacked anyways (introducing unnecessary overhead). The reasoning behind keeping long-lived data packed is that it is more likely to be used in an intermediate datatype or function which will require repacking, while this is easier to control with short-lived data which is only passed to a few functions before being garbage collected.

3 . In 99% of cases, you won't have issues with garbage collector pauses. In general, there aren't things you can do to guarantee the GC will not pause. The only suggestion I have is, don't try to reinvent the wheel. There are libraries designed for high-performance computations with large amounts of data (repa, vector, etc). If you try to implement it yourself, chances are, they did it better!

Upvotes: 9

Tarrasch
Tarrasch

Reputation: 10547

If you define data Int2 = Int, You could think of Int# being unboxed, plain Int being boxed and Int2 as "double boxed". Is you used newtype instead of data, it would have avoided one indirection. But Int itself is still boxed. Therefore Int2 is boxed too.

As for packed form, without going into to much details, it intuitively is similar to this kind of C-code.

struct PackedCoordinate {
   int x;
   int y;
}

struct UnpackedCoordinate {
   int *x;
   int *y;
}

I'm not sure why he suggested long lived data to be in packed form. Anyhow, it seems from the documentation I linked to that one should be careful using the {-# UNPACK #-} pragma, because if you're unlucky, GHC might need to repack it's values before function calls, making it allocating more memory than it would if it wasn't unpacked to begin with.

In order to avoid garbage collections. I think you should approach this as anything else related to profiling: Find the bottle-neck in your program and then work from there.


PS. Please comment on anything I happen to be incorrect about. :)

Upvotes: 2

Related Questions