Reputation: 23
I am trying to extract raw text from a PDF file. Already I found PoDoFo library, which seems to be able make this job.
Based on this answer there is what I did for now:
#include <iostream>
#include <string>
#include <podofo/podofo.h>
//using namespace PoDoFo;
int main( int argc, char* argv[] )
{
PoDoFo::PdfMemDocument pdf("inputpdftest.pdf");
for (int pn = 0; pn < pdf.GetPageCount(); ++pn)
{
std::cout << "Page: " << pn << std::endl;
PoDoFo::PdfPage* page = pdf.GetPage(pn);
PoDoFo::PdfContentsTokenizer tok(page);
const char* token = NULL;
PoDoFo::PdfVariant var;
PoDoFo::EPdfContentsType type;
while (tok.ReadNext(type, token, var))
{
if (type == PoDoFo::ePdfContentsType_Keyword)
{
// process type, token & var
if (var.IsArray())
{
PoDoFo::PdfArray& a = var.GetArray();
for (size_t i = 0; i < a.GetSize(); i++)
{
if (a[i].IsString())
{
std::string str = a[i].GetString().GetStringUtf8();
std::cout << str << " ";
}
}
}
}
}
}
return 0;
}
Output is already exactly the same as opening PDF using Notepad, just some trash like:
( : ˝ ˝ - H - ( : ˝ ˇ ; 7 < ˝ ˙ ˝ ) ˆ + 0 ( : ˝ % ˆ % ˘ ˚ : ˇ ( 7 < ˝ ˙ ˝ ) ( - ˝ % ' ˝ ) - 0 ˝ ˜ % / ˚ ( ˙ ˚ : ˇ ( 7 < ˝ ˙ ˝ ) ( - ˝ % ' ˝ ) - 0 ˝ ˜ % / ˚ ( ˙ ˚ : ˇ ˆ 7 < ˝ ˙ ˝ )
It's obvious, because I did not managed to convert this informations to normal text, what I am asking how to do it?
So, as You can see I have to process data of PDF using GetString
function. Now I am going through each token, checking if is array (and contains PDF commands like TJ
etc.), then using on such element GetString
. In mentioned by me answer there is nothing said about how I can handle this further.
From documentation Returns the strings contents
it is an array and I should iterate over it?
Input PDF is NOT a scanned picture, or image. In given file there would be always some text, which is possible to higlight, and copy it manually, or search for a word.
I sincerely ask for answer how can I get text from such data.
Upvotes: 2
Views: 2124
Reputation: 100718
The problem is the comment
// process type, token & var
Was intended to be replaced with code that actually does a bit of processing.
The code inside the if (var.IsArray())
test should only be executed if you've determined that the current command is TJ
. You still need to process a number of text commands.
For a better example, look at the source of the podofotextextract tool in the podofo source: https://svn.code.sf.net/p/podofo/code/podofo/trunk/tools/podofotxtextract
Upvotes: 1