Reputation: 1534
My client needs to have a CSV with name,surname,DOB from their accounting database.
The problem is, their accounting software is "in the cloud" (hence, in someone else's computer and freely accessible from anyone in the world) and all this webapp can do is generate a very badly formatted "welcome card pdf", like this
hi <newline>
<lots of spaces>my name is %name% <lots of spaces> %surname%
<lots of newlines and spaces to simulate text alignment to the right>I was born in %dob
<newpage>
So, all I can get is a 500 pages PDF with this unusable content.
Is there a way to extract data from such a file?
Upvotes: 1
Views: 399
Reputation: 1534
I did it! Thanks for the hints, this is how I made the useless PDF become an useful CSV:
cat -A
;
Upvotes: 1
Reputation: 678
It is important to know if you have to do this multiple times or just once to one 500 page file. I will assume just once.
In which case, get PDF converted to XML (if at all possible) or text file (many converters available - just google).
Then it is important to know if all 'records' are formatted the same way - so is the format: .... firstname...lastname...dob...addressline1.... (where ... is stuff you don't want)
Are there always 'labels' or 'tags' that tell you the next thing is 'address line 1' or if a value is missed can you tell?
If the structure is always the same and you can tell if a value is not on this record then you have a fighting chance to write regex expressions to transform it to a decent format. Otherwise it will be very hard but you might be able to harvest a lot (if not all) the info.
Upvotes: 2