Reputation: 2292

Text extraction from table cells

I have a pdf. The pdf contains a table. The table contains many cells (>100). I know the exact position (x,y) and dimension (w,h) of every cell of the table.
I need to extract text from cells using itextsharp. Using PdfReaderContentParser + FilteredTextRenderListener (using a code like this http://itextpdf.com/examples/iia.php?id=279 ) I can extract text but I need to run the whole procedure for each cell. My pdf have many cells and the program needs too much time to run. Is there a way to extract text from a list of "rectangle"? I need to know the text of each rectangle. I'm looking for something like PDFTextStripperByArea by PdfBox (you can define as many regions as you need and the get text using .getTextForRegion("region-name") ).

Upvotes: 0

Answers (2)

Shaun

Reputation: 207

Here's my take on how to extract text from a table-like structure in a PDF using itextsharp. It returns a collection of rows and each row contains a collection of interpreted columns. This may work for you on the premise that there is a gap between one column and the next which is greater than the average width of a single character. I also added an option to check for wrapped text within a virtual column. Your mileage may vary.

   using (PdfReader pdfReader = new PdfReader(stream))
        {
            for (int page = 1; page <= pdfReader.NumberOfPages; page++)
            {

                TableExtractionStrategy tableExtractionStrategy = new TableExtractionStrategy();
                string pageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, tableExtractionStrategy);
                var table = tableExtractionStrategy.GetTable();

            }
        }



        public class TableExtractionStrategy : LocationTextExtractionStrategy
        {
            public float NextCharacterThreshold { get; set; } = 1;
            public int NextLineLookAheadDepth { get; set; } = 500;
            public bool AccomodateWordWrapping { get; set; } = true;

            private List<TableTextChunk> Chunks { get; set; } = new List<TableTextChunk>();

            public override void RenderText(TextRenderInfo renderInfo)
            {
                base.RenderText(renderInfo);
                string text = renderInfo.GetText();
                Vector bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
                Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
                Rectangle rectangle = new Rectangle(bottomLeft[Vector.I1], bottomLeft[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
                Chunks.Add(new TableTextChunk(rectangle, text));
            }

            public List<List<string>> GetTable()
            {
                List<List<string>> lines = new List<List<string>>();
                List<string> currentLine = new List<string>();

                float? previousBottom = null;
                float? previousRight = null;

                StringBuilder currentString = new StringBuilder();

                // iterate through all chunks and evaluate 
                for (int i = 0; i < Chunks.Count; i++)
                {
                    TableTextChunk chunk = Chunks[i];

                    // determine if we are processing the same row based on defined space between subsequent chunks
                    if (previousBottom.HasValue && previousBottom == chunk.Rectangle.Bottom)
                    {
                        if (chunk.Rectangle.Left - previousRight > 1)
                        {
                            currentLine.Add(currentString.ToString());
                            currentString.Clear();
                        }
                        currentString.Append(chunk.Text);
                        previousRight = chunk.Rectangle.Right;
                    }
                    else
                    {
                        // if we are processing a new line let's check to see if this could be word wrapping behavior
                        bool isNewLine = true;
                        if (AccomodateWordWrapping)
                        {
                            int readAheadDepth = Math.Min(i + NextLineLookAheadDepth, Chunks.Count);
                            if (previousBottom.HasValue)
                                for (int j = i; j < readAheadDepth; j++)
                                {
                                    if (previousBottom == Chunks[j].Rectangle.Bottom)
                                    {
                                        isNewLine = false;
                                        break;
                                    }
                                }
                        }

                        // if the text was not word wrapped let's treat this as a new table row
                        if (isNewLine)
                        {
                            if (currentString.Length > 0)
                                currentLine.Add(currentString.ToString());
                            currentString.Clear();

                            previousBottom = chunk.Rectangle.Bottom;
                            previousRight = chunk.Rectangle.Right;
                            currentString.Append(chunk.Text);

                            if (currentLine.Count > 0)
                                lines.Add(currentLine);

                            currentLine = new List<string>();
                        }
                        else
                        {
                            if (chunk.Rectangle.Left - previousRight > 1)
                            {
                                currentLine.Add(currentString.ToString());
                                currentString.Clear();
                            }
                            currentString.Append(chunk.Text);
                            previousRight = chunk.Rectangle.Right;

                        }
                    }
                }

                return lines;
            }

            private struct TableTextChunk
            {
                public Rectangle Rectangle;
                public string Text;

                public TableTextChunk(Rectangle rect, string text)
                {
                    Rectangle = rect;
                    Text = text;
                }

                public override string ToString()
                {
                    return Text + " (" + Rectangle.Left + ", " + Rectangle.Bottom + ")";
                }
            }
        }

Upvotes: 1

mkl

Reputation: 96009

This option is not immediately included in the iTextSharp distribution but it is easy to realize. In the following I use the iText (Java) class, interface, and method names because I am more at home with Java. They should easily be translatable into iTextSharp (C#) names.

If you use the LocationTextExtractionStrategy, you can can use its a posteriori TextChunkFilter mechanism instead of the a priori FilteredRenderListener mechanism used in the sample you linked to. This mechanism has been introduced in version 5.3.3.

For this you first parse the whole page content using the LocationTextExtractionStrategy without any FilteredRenderListener filtering applied. This makes the strategy object collect TextChunk objects for all PDF text objects on the page containing the associated base line segment.

Then you call the strategy's getResultantText overload with a TextChunkFilter argument (instead of the regular no-argument overload):

public String getResultantText(TextChunkFilter chunkFilter)

You call it with a different TextChunkFilter instance for each table cell. You have to implement this filter interface which is not too difficult as it only defines one method:

public static interface TextChunkFilter
{
    /**
     * @param textChunk the chunk to check
     * @return true if the chunk should be allowed
     */
    public boolean accept(TextChunk textChunk);
}

So the accept method of the filter for a given cell must test whether the text chunk in question is inside your cell.

(Instead of separate instances for each cell you can of course also create one instance whose parameters, i.e. cell coordinates, can be changed between getResultantText calls.)

PS: As mentioned by the OP, this TextChunkFilter has not yet been ported to iTextSharp. It should not be hard to do so, though, only one small interface and one method to add to the strategy.

PPS: In a comment sschuberth asked

Do you then still call PdfTextExtractor.getTextFromPage() when using getResultantText(), or does it somehow replace that call? If so, how to you then specify the page to extract to?

Actually PdfTextExtractor.getTextFromPage() internally already uses the no-argument getResultantText() overload:

public static String getTextFromPage(PdfReader reader, int pageNumber, TextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators) throws IOException
{
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    return parser.processContent(pageNumber, strategy, additionalContentOperators).getResultantText();
}

To make use of a TextChunkFilter you could simply build a similar convenience method, e.g.

public static String getTextFromPage(PdfReader reader, int pageNumber, LocationTextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators, TextChunkFilter chunkFilter) throws IOException
{
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    return parser.processContent(pageNumber, strategy, additionalContentOperators).getResultantText(chunkFilter);
}

In the context at hand, though, in which we want to parse the page content only once and apply multiple filters, one for each cell, we might generalize this to:

public static List<String> getTextFromPage(PdfReader reader, int pageNumber, LocationTextExtractionStrategy strategy, Map<String, ContentOperator> additionalContentOperators, Iterable<TextChunkFilter> chunkFilters) throws IOException
{
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    parser.processContent(pageNumber, strategy, additionalContentOperators)

    List<String> result = new ArrayList<>();
    for (TextChunkFilter chunkFilter : chunkFilters)
    {
        result.add(strategy).getResultantText(chunkFilter);
    }
    return result;
}

(You can make this look fancier by using Java 8 collection streaming instead of the old'fashioned for loop.)

Upvotes: 3

Text extraction from table cells

Answers (2)

Related Questions