Drakonno
Drakonno

Reputation: 23

C++ PoDoFo - How can I convert PDF into raw TXT file?

I am trying to extract raw text from a PDF file. Already I found PoDoFo library, which seems to be able make this job.

Based on this answer there is what I did for now:

#include <iostream>
#include <string>
#include <podofo/podofo.h>

//using namespace PoDoFo;

int main( int argc, char* argv[] )
{
    PoDoFo::PdfMemDocument pdf("inputpdftest.pdf");
    for (int pn = 0; pn < pdf.GetPageCount(); ++pn) 
    {
        std::cout << "Page: " << pn << std::endl;
        PoDoFo::PdfPage* page = pdf.GetPage(pn);
        PoDoFo::PdfContentsTokenizer tok(page);
        const char* token = NULL;
        PoDoFo::PdfVariant var;
        PoDoFo::EPdfContentsType type;
        while (tok.ReadNext(type, token, var)) 
        {
            if (type == PoDoFo::ePdfContentsType_Keyword)
            {
                // process type, token & var
                if (var.IsArray()) 
                {
                    PoDoFo::PdfArray& a = var.GetArray();
                    for (size_t i = 0; i < a.GetSize(); i++)
                    {
                        if (a[i].IsString())
                        {
                            std::string str = a[i].GetString().GetStringUtf8();
                            std::cout << str << " ";
                        }
                    }
                }
            }
        }
    }
    return 0;
}

Output is already exactly the same as opening PDF using Notepad, just some trash like:

  ( : ˝  ˝   - H  -   ( : ˝ ˇ  ; 7  < ˝ ˙ ˝  )     ˆ + 0  ( : ˝     % ˆ % ˘ ˚ : ˇ  ( 7  < ˝ ˙ ˝  )       ( -  ˝   % ' ˝ ) - 0 ˝      ˜ % / ˚ (  ˙ ˚ : ˇ  ( 7  < ˝ ˙ ˝  )       ( -  ˝   % ' ˝ ) - 0 ˝    ˜ % / ˚ (  ˙ ˚ : ˇ  ˆ 7  < ˝ ˙ ˝  )    

It's obvious, because I did not managed to convert this informations to normal text, what I am asking how to do it?

So, as You can see I have to process data of PDF using GetString function. Now I am going through each token, checking if is array (and contains PDF commands like TJ etc.), then using on such element GetString. In mentioned by me answer there is nothing said about how I can handle this further.

From documentation Returns the strings contents it is an array and I should iterate over it?

Input PDF is NOT a scanned picture, or image. In given file there would be always some text, which is possible to higlight, and copy it manually, or search for a word.

Example PDF

I sincerely ask for answer how can I get text from such data.

Upvotes: 2

Views: 2124

Answers (1)

Ferruccio
Ferruccio

Reputation: 100718

The problem is the comment

// process type, token & var

Was intended to be replaced with code that actually does a bit of processing. The code inside the if (var.IsArray()) test should only be executed if you've determined that the current command is TJ. You still need to process a number of text commands.

For a better example, look at the source of the podofotextextract tool in the podofo source: https://svn.code.sf.net/p/podofo/code/podofo/trunk/tools/podofotxtextract

Upvotes: 1

Related Questions