Magnetic_dud
Magnetic_dud

Reputation: 1534

How can I extract a table from a badly formatted PDF?

My client needs to have a CSV with name,surname,DOB from their accounting database.

The problem is, their accounting software is "in the cloud" (hence, in someone else's computer and freely accessible from anyone in the world) and all this webapp can do is generate a very badly formatted "welcome card pdf", like this

hi <newline>
<lots of spaces>my name is %name% <lots of spaces> %surname%
<lots of newlines and spaces to simulate text alignment to the right>I was born in %dob
<newpage>

So, all I can get is a 500 pages PDF with this unusable content.

Is there a way to extract data from such a file?

Upvotes: 1

Views: 399

Answers (2)

Magnetic_dud
Magnetic_dud

Reputation: 1534

I did it! Thanks for the hints, this is how I made the useless PDF become an useful CSV:

  1. I converted the PDF to TXT using cloudconvert.com
  2. I watched how the file was, with cat -A
  3. I noticed that there was a newline right before every useful data
  4. I noticed that every page ended with a FORM FEED character
  5. I replaced every newline character with a ;
  6. I replaced every FORM FEED character with a newline character
  7. I imported the (newly made) CSV in Libreoffice and I deleted useless columns

Upvotes: 1

PaulG
PaulG

Reputation: 678

It is important to know if you have to do this multiple times or just once to one 500 page file. I will assume just once.

In which case, get PDF converted to XML (if at all possible) or text file (many converters available - just google).

Then it is important to know if all 'records' are formatted the same way - so is the format: .... firstname...lastname...dob...addressline1.... (where ... is stuff you don't want)

Are there always 'labels' or 'tags' that tell you the next thing is 'address line 1' or if a value is missed can you tell?

If the structure is always the same and you can tell if a value is not on this record then you have a fighting chance to write regex expressions to transform it to a decent format. Otherwise it will be very hard but you might be able to harvest a lot (if not all) the info.

Upvotes: 2

Related Questions