How to read table?

Question

I have a time table in PDF file.

            (1)     (2)     (3)
            09:00   10:30   11:30            
Monday      12C     11B     10A
Tuesday     10K     10K     9A
Wednesday           7A
Thursday    7B      7B
Friday      6A              11B

I am reading all text using iTextSharp.

    private static string ReadFile(string path)
    {
        using (var reader = new PdfReader(path))
        {
            var text = new StringBuilder();

            for (var i = 1; i <= reader.NumberOfPages; i++)
                text.Append(PdfTextExtractor.GetTextFromPage(reader, i));

            return text.ToString();
        }
    }

This text response lines are like this:

(1) (2) (3) 
09:00 10:30 11:30
12C 11B 10A
Monday
10K 10K 9A
Tuesday
7A
Wednesday
B 7B
Thursday
6A  11B
Friday

So I can not understand which class is at which time? For example Wednesday has a class 7A, but which time (09:00 or 10:30 or 11:30)? If it write a white space charecter ( ), I can understand.

(1) (2) (3) 
09:00 10:30 11:30
12C 11B 10A
Monday
10K 10K 9A
Tuesday
  7A  
Wednesday
B 7B  
Thursday
6A   11B
Friday

Is this possible using iTextSharp?

Joris Schellekens · Accepted Answer

This is not possible in the general case.

If your PDF document is not tagged, the document itself does not contain structure information. Or to put it simply, the document does not know which parts are tables, or table rows, or even paragraphs.

Extracting structure information from an untagged PDF document is hard. If not to say impossible in the general case.

Using pdf2Data, you can achieve this. The caveat is that you have to define the template up front. So you'd need to tell the software where it can expect a table.

You can have a look at SimpleTextExtractionStrategy in iText. It essentially processes all rendering information, and decides when to concatenate text to the existing buffer.

At some point in the code it decides that if the buffer already ends with whitespace, no more whitespace should be appended.

I would suggest you create your own implementation of SimpleTextExtractionStrategy that overrides this behaviour and always inserts whitespace.

How to read table?

Answers (1)

Related Questions