Reputation: 5053
I'm using the PDFNet library to extract objects from a PDF and then OCR. I instantiate my Elements
object:
public class Processor
{
public static int Main(string[] args)
{
Elements pdfPageElements = new Elements(pdfPage);
...
The constructor (in a separate class) looks like
internal class Elements : IEnumerator<Element>, IEnumerable<Element>
{
private readonly int _position;
private readonly ElementReader _pdfElements;
private Element _current;
public Elements(Page currentPage)
{
_pdfElements = new ElementReader();
_pdfElements.Begin(currentPage);
_position = 0;
}
...
After instantiating pdfPageElements
I go back to Main() and use Linq to iterate through the collection items to get the PDF objects (in this case images) that I want.
var pdfPageImages = (from e in pdfPageElements
where
(e.GetType() == Element.Type.e_inline_image ||
e.GetType() == Element.Type.e_image)
select e);
The PDFNet SDK implements the MoveNext() Method as follows:
public bool MoveNext()
{
if ((_current = _pdfElements.Next()) != null)
{
return true;
}
else
{
_pdfElements.Dispose();
return false;
}
}
pdfPageImages
is instatiatied nicely; Console.WriteLine(pdfPageImages.Count());
returns the right number of images for my test PDF.
But when I send pdfPageImages
through a foreach loop
I get the following exception:
pdftron.Common.PDFNetException: Unknown exception.
at pdftron.PDF.ElementReader.Next()
at pdftron.Elements.MoveNext()
at System.Linq.Enumerable.WhereEnumerableIterator`1.MoveNext()
at DM_PDFProcessor.Processor.Main(String[] args)
It's probably worthwhile to note that int he PDFNet Documentation it states that:
Every call to ElementReader::Next() destroys the current Element.
Therefore, an Element becomes invalid after subsequent
ElementReader::Next() operation.
However, once the element is read into the IEnumerable pdfPageImages, it should be iterable indefinitely (from my limited understanding).
Note that the elements in the collection are definitely not null. Any ideas why I keep getting the exception?
Upvotes: 1
Views: 200
Reputation: 1216
Note that
var pdfPageImages = (from e in pdfPageElements
where
(e.GetType() == Element.Type.e_inline_image ||
e.GetType() == Element.Type.e_image)
select e);
is lazily evaluated. That is, every time pdfPageImages
is enumerated, pdfPageElements
is also enumerated. So if the Elements
class is built so that an instance can only be enumerated once without throwing, you might want to cache the query result:
var pdfPageImages = (from e in pdfPageElements
where
(e.GetType() == Element.Type.e_inline_image ||
e.GetType() == Element.Type.e_image)
select e).ToList();
Upvotes: 3