pseudocode
pseudocode

Reputation: 219

iTextSharp How to read Table in PDF file

I am working on convert PDF to text. I can get text from PDF correctly but it is being complicated in table structure. I know PDF doesn't support table structure but I think there is a way get cells correctly. Well, for example:

I want to convert to text like this:

> This is first example.

> This is second example.

But, when I convert PDF to text, theese datas looking like this:

> This is This is

> first example. second example.

How can I get values correctly?

--EDIT:

Here is how did I convert PDF to Text:

OpenFileDialog ofd = new OpenFileDialog();
        string filepath;
        ofd.Filter = "PDF Files(*.PDF)|*.PDF|All Files(*.*)|*.*";

        if (ofd.ShowDialog() == DialogResult.OK)
        {
            filepath = ofd.FileName.ToString();

            string strText = string.Empty;
            try
            {
                PdfReader reader = new PdfReader(filepath);

                for (int page = 1; page < reader.NumberOfPages; page++)
                {
                    ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
                    string s = PdfTextExtractor.GetTextFromPage(reader, page, its);

                    s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
                    strText += s;
                }
                reader.Close();
             }
             catch (Exception ex)
            {
                MessageBox.Show(ex.Message);
            }
        }

Upvotes: 3

Views: 20249

Answers (1)

mkl
mkl

Reputation: 95918

To make my comment an actual answer...

You use the LocationTextExtractionStrategy for text extraction:

ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);

This strategy arranges all text it finds in left-to-right lines from top to bottom (actually also taking the text line angle into account). Thus, it clearly is not what you need to extract text from tables with cells with multi-line content.

Depending on the document in question there are different approaches one can take:

  • Use the iText SimpleTextExtractionStrategy if the text drawing operations in the document in question already are in the order one wants for text extraction.
  • Use a custom text extraction strategy which makes use of tagging information if the document tables are properly tagged.
  • Use a complex custom text extraction strategy which tries to get hints from text arrangements, line paths, or background colors to guess the table cell structure and extract text cell by cell.

In this case, the OP commented that he changed LocationTextExtractionStrategy with SimpleTextExtractionStrategy, then it worked.

Upvotes: 5

Related Questions