Reputation: 352
I have a pdf textbook which has math equations like this:
However, if i attempt a simple text extraction i get something along the lines of: V(r) = - 3 - - 2R R2 This is not an image, it is text but I don't know how to preserve the way it looks and get the actual characters into a text file.
Upvotes: 0
Views: 2814
Reputation: 7048
The problem you are running into is a frequently encountered one. PDF essentially doesn't care about structure. It has no notion of a column, paragraph, a line of text or even a word, let alone a mathematical formula with lots of special formatting.
PDF - essentially - is only interested in placing things on a page at a specific location. And that's exactly what it does with your formulas as well, it will use the characters and graphics you need for your formulas and put them somewhere on the page. Without any additional knowledge that you could use afterwards to figure out that these characters and graphics even belong to a formula; let alone reconstruct it while doing text extraction.
Two additional points:
1) If you share an example of such a PDF document, we could have a look if there is some useful information in it that could be used to extract this formula in a more competent way; but the chance is close to zero.
2) You would also have to define what a "useful way" from your point of view is. Formulas don't translate well to plain text files, so you probably need something like MathML to store them in.
Upvotes: 2