C++ PoDoFo - How can I convert PDF into raw TXT file?

Question

I am trying to extract raw text from a PDF file. Already I found PoDoFo library, which seems to be able make this job.

Based on this answer there is what I did for now:

#include 
#include 
#include 

//using namespace PoDoFo;

int main( int argc, char* argv[] )
{
    PoDoFo::PdfMemDocument pdf("inputpdftest.pdf");
    for (int pn = 0; pn < pdf.GetPageCount(); ++pn) 
    {
        std::cout << "Page: " << pn << std::endl;
        PoDoFo::PdfPage* page = pdf.GetPage(pn);
        PoDoFo::PdfContentsTokenizer tok(page);
        const char* token = NULL;
        PoDoFo::PdfVariant var;
        PoDoFo::EPdfContentsType type;
        while (tok.ReadNext(type, token, var)) 
        {
            if (type == PoDoFo::ePdfContentsType_Keyword)
            {
                // process type, token & var
                if (var.IsArray()) 
                {
                    PoDoFo::PdfArray& a = var.GetArray();
                    for (size_t i = 0; i < a.GetSize(); i++)
                    {
                        if (a[i].IsString())
                        {
                            std::string str = a[i].GetString().GetStringUtf8();
                            std::cout << str << " ";
                        }
                    }
                }
            }
        }
    }
    return 0;
}

Output is already exactly the same as opening PDF using Notepad, just some trash like:

  ( : ˝  ˝   - H  -   ( : ˝ ˇ  ; 7  < ˝ ˙ ˝  )     ˆ + 0  ( : ˝     % ˆ % ˘ ˚ : ˇ  ( 7  < ˝ ˙ ˝  )       ( -  ˝   % ' ˝ ) - 0 ˝      ˜ % / ˚ (  ˙ ˚ : ˇ  ( 7  < ˝ ˙ ˝  )       ( -  ˝   % ' ˝ ) - 0 ˝    ˜ % / ˚ (  ˙ ˚ : ˇ  ˆ 7  < ˝ ˙ ˝  )

It's obvious, because I did not managed to convert this informations to normal text, what I am asking how to do it?

So, as You can see I have to process data of PDF using GetString function. Now I am going through each token, checking if is array (and contains PDF commands like TJ etc.), then using on such element GetString. In mentioned by me answer there is nothing said about how I can handle this further.

From documentation Returns the strings contents it is an array and I should iterate over it?

Input PDF is NOT a scanned picture, or image. In given file there would be always some text, which is possible to higlight, and copy it manually, or search for a word.

Example PDF

I sincerely ask for answer how can I get text from such data.

C++ PoDoFo - How can I convert PDF into raw TXT file?

Answers (1)

Related Questions