Reputation: 43545
I have a class that generates DNA sequences, that are represented by long strings. This class implements the IEnumerable<string>
interface, and it can produce an infinite number of DNA sequences. Below is a simplified version of my class:
class DnaGenerator : IEnumerable<string>
{
private readonly IEnumerable<string> _enumerable;
public DnaGenerator() => _enumerable = Iterator();
private IEnumerable<string> Iterator()
{
while (true)
foreach (char c in new char[] { 'A', 'C', 'G', 'T' })
yield return new String(c, 10_000_000);
}
public IEnumerator<string> GetEnumerator() => _enumerable.GetEnumerator();
IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}
This class generates the DNA sequences by using an iterator. Instead of invoking the iterator again and again, an IEnumerable<string>
instance is created during the construction and is cached as a private field. The problem is that using this class results in a sizable chunk of memory being constantly allocated, with the garbage collector being unable to recycle this chunk. Here is a minimal demonstration of this behavior:
var dnaGenerator = new DnaGenerator();
Console.WriteLine($"TotalMemory: {GC.GetTotalMemory(true):#,0} bytes");
DoWork(dnaGenerator);
GC.Collect();
Console.WriteLine($"TotalMemory: {GC.GetTotalMemory(true):#,0} bytes");
GC.KeepAlive(dnaGenerator);
static void DoWork(DnaGenerator dnaGenerator)
{
foreach (string dna in dnaGenerator.Take(5))
{
Console.WriteLine($"Processing DNA of {dna.Length:#,0} nucleotides" +
$", starting from {dna[0]}");
}
}
Output:
TotalMemory: 84,704 bytes
Processing DNA of 10,000,000 nucleotides, starting from A
Processing DNA of 10,000,000 nucleotides, starting from C
Processing DNA of 10,000,000 nucleotides, starting from G
Processing DNA of 10,000,000 nucleotides, starting from T
Processing DNA of 10,000,000 nucleotides, starting from A
TotalMemory: 20,112,680 bytes
My expectation was that all generated DNA sequences would be eligible for garbage collection, since they are not referenced by my program. The only reference that I hold is the reference to the DnaGenerator
instance itself, which is not meant to contain any sequences. This component just generates the sequences. Nevertheless, no matter how many or how few sequences my program generates, there are always around 20 MB of memory allocated after a full garbage collection.
My question is: Why is this happening? And how can I prevent this from happening?
.NET 6.0, Windows 10, 64-bit operating system, x64-based processor, Release built.
Update: The problem disappears if I replace this:
public IEnumerator<string> GetEnumerator() => _enumerable.GetEnumerator();
...with this:
public IEnumerator<string> GetEnumerator() => Iterator().GetEnumerator();
But I am not a fan of creating a new enumerable each time an enumerator is needed. My understanding is that a single IEnumerable<T>
can create many IEnumerator<T>
s. AFAIK these two interfaces are not meant to have an one-to-one relationship.
Upvotes: 10
Views: 717
Reputation: 142173
Note that 10_000_000 of chars (which are 16 bit) will take approximately 20 MB. If you will take a look at the decompilation
you will notice that yeild return
results in internal <Iterator>
class generated which in turn has a current
field to store the string (to implement IEnumerator<string>.Current
):
[CompilerGenerated]
private sealed class <Iterator>d__2 : IEnumerable<string>, IEnumerable, IEnumerator<string>, IEnumerator, IDisposable
{
...
private string <>2__current;
...
}
And Iterator
method internally will be compiled to something like this:
[IteratorStateMachine(typeof(<Iterator>d__2))]
private IEnumerable<string> Iterator()
{
return new <Iterator>d__2(-2);
}
Which leads to the current string always being stored in memory for _enumerable.GetEnumerator();
implementation (after iteration start) while DnaGenerator
instance is not GCed itself.
UPD
My understanding is that a single IEnumerable can create many IEnumerators. AFAIK these two interfaces are not meant to have an one-to-one relationship.
Yes, in case of generated for yield return
enumerable it can create multiple enumerators, but in this particular case the implementation have "one-to-one" relationship because the generated implementation is both IEnumerable
and IEnumerator
:
private sealed class <Iterator>d__2 :
IEnumerable<string>, IEnumerable,
IEnumerator<string>, IEnumerator,
IDisposable
But I am not a fan of creating a new enumerable each time an enumerator is needed.
But it is actually what is happening when you call _enumerable.GetEnumerator()
(which is obviously an implementation detail), if you check already mentioned decompilation you will see that _enumerable = Iterator()
is actually new <Iterator>d__2(-2)
and <Iterator>d__2.GetEnumerator()
looks something like this:
IEnumerator<string> IEnumerable<string>.GetEnumerator()
{
if (<>1__state == -2 && <>l__initialThreadId == Environment.CurrentManagedThreadId)
{
<>1__state = 0;
return this;
}
return new <Iterator>d__2(0);
}
So it actually should create a new iterator instance every time except the first enumeration, so your public IEnumerator<string> GetEnumerator() => Iterator().GetEnumerator();
approach is just fine.
Upvotes: 5
Reputation: 43545
@GuruStron's answer demonstrated that the problem that I've presented here was created by my shallow understanding of the C# iterators, and of how they are implemented internally. By storing an IEnumerable<string>
in my DnaGenerator
instances, I am gaining essentially nothing. When an enumerator is requested, both lines below result in allocating a single object. It's an autogenerated object with dual personality. It is both an IEnumerable<string>
, and an IEnumerator<string>
.
public IEnumerator<string> GetEnumerator() => _enumerable.GetEnumerator();
public IEnumerator<string> GetEnumerator() => Iterator().GetEnumerator();
By storing the _enumerable
in a field I am just preventing this object from getting recycled.
Nevertheless I am still searching for ways to solve this non-issue, in a way that would allow me to keep the cached _enumerable
field, without causing a memory leak, and without resorting to implementing a full fledged IEnumerable<string>
from scratch as shown in @MatthewWatson's answer. The workaround that I found is to wrap my generated DNA sequences in StrongBox<string>
wrappers:
private IEnumerable<StrongBox<string>> Iterator()
{
while (true)
foreach (char c in new char[] { 'A', 'C', 'G', 'T' })
yield return new(new String(c, 10_000_000));
}
Then I have to Unwrap
the iterator before exposing it to the external world:
private readonly IEnumerable<string> _enumerable;
public DnaGenerator() => _enumerable = Iterator().Unwrap();
Here is the Unwrap
extension method:
/// <summary>
/// Unwraps an enumerable sequence that contains values wrapped in StrongBox instances.
/// The latest StrongBox instance is emptied when the enumerator is disposed.
/// </summary>
public static IEnumerable<T> Unwrap<T>(this IEnumerable<StrongBox<T>> source)
=> new StrongBoxUnwrapper<T>(source);
private class StrongBoxUnwrapper<T> : IEnumerable<T>
{
private readonly IEnumerable<StrongBox<T>> _source;
public StrongBoxUnwrapper(IEnumerable<StrongBox<T>> source)
{
ArgumentNullException.ThrowIfNull(source);
_source = source;
}
public IEnumerator<T> GetEnumerator() => new Enumerator(_source.GetEnumerator());
IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
private class Enumerator : IEnumerator<T>
{
private readonly IEnumerator<StrongBox<T>> _source;
private StrongBox<T> _latest;
public Enumerator(IEnumerator<StrongBox<T>> source)
{
ArgumentNullException.ThrowIfNull(source);
_source = source;
}
public T Current => _source.Current.Value;
object IEnumerator.Current => Current;
public bool MoveNext()
{
var moved = _source.MoveNext();
_latest = _source.Current;
return moved;
}
public void Dispose()
{
_source.Dispose();
if (_latest is not null) _latest.Value = default;
}
public void Reset() => _source.Reset();
}
}
The trick is to keep track of the latest StrongBox<T>
that has been emitted by the enumerator, and set its Value
to default
when the enumerator is disposed.
Upvotes: 0
Reputation: 429
If memory usage (or speed) is an concern, you might (also) want to use bytes (or ints) to represent 4 nucleotides at once. Given what you shared with us, that might be the case.
Upvotes: 1
Reputation: 109577
The problem is caused by the auto-generated implementation for the code using yield
.
You can mitigate this somewhat by explicitly implementing the enumerator.
You have to fiddle it a bit by calling .Reset()
from public IEnumerator<string> GetEnumerator()
to ensure the enumeration restarts at each call:
class DnaGenerator : IEnumerable<string>
{
private readonly IEnumerator<string> _enumerable;
public DnaGenerator() => _enumerable = new IteratorImpl();
sealed class IteratorImpl : IEnumerator<string>
{
public bool MoveNext()
{
return true; // Infinite sequence.
}
public void Reset()
{
_index = 0;
}
public string Current
{
get
{
var result = new String(_data[_index], 10_000_000);
if (++_index >= _data.Length)
_index = 0;
return result;
}
}
public void Dispose()
{
// Nothing to do.
}
readonly char[] _data = { 'A', 'C', 'G', 'T' };
int _index;
object IEnumerator.Current => Current;
}
public IEnumerator<string> GetEnumerator()
{
_enumerable.Reset();
return _enumerable;
}
IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}
Upvotes: 6