Reputation: 4563
The PDF in this link (http://www.lenovo.com/psref/pdf/psref450.pdf) contains a number of tables like this:
I'd like to programmatically extract the data and the structure from these tables.
Things I've tried: converting the PDF to HTML using
I was planning to convert the PDF to HTML and then parse it with BeautifulSoup.
The output could be JSON (e.g. one object per table), XML, or pretty much any format that maintains the structure.
Upvotes: 6
Views: 5522
Reputation: 1343
import pdfplumber
import pandas as pd
filepath = r"actualFile_path"
outfile = r"destination_path"
pdf = pdfplumber.open(filepath)
for i in range(int(len(pdf.pages))):
df = pd.DataFrame()
table = pdf.pages[i].extract_table(table_settings=
{"vertical_strategy": "text", "horizontal_strategy": "text"})
df = pd.DataFrame(table, columns=table)
df.to_csv(outfile2, mode='a', index=False)
Upvotes: 0
Reputation: 1390
You could try PDFBox. The documentation for that is here:
https://pdfbox.apache.org/1.8/cookbook/textextraction.html
Extend org.apache.pdfbox.pdfviewer.PDFPageDrawer and override the strokePath method. From there you can intercept the drawing operations for horizontal and vertical line segments and use that information to determine the column and row positions. You can set up text regions to determine which numbers/letters/characters are drawn in which region. Since you know the layout of the regions are tabular you'll be able to define tables and tell which column and row the extracted text belongs to using simple algorithms.
Upvotes: 6
Reputation: 22478
Only FYI, as mine is not a publicly available tool: it sure is possible. Here is this one table in plain text form -- the spaces in between are tabs, not spaces:
2469-2TU i5-3320M 4GBx1 14.0" HD 720p 500G 7200 Intel 620528 WWAN upg Express 54 Finger BT 6 Win7 Pro64 10/12
✂ 2469-2SU i5-3210M 4GBx1 14.0" HD 720p 500G 7200 Intel 2200 WWAN upg Express 54 None None 6 Win7 Pro64 10/12
✂ 2469-2RU i3-3110M 4GBx1 14.0" HD 720p 320G 7200 Intel 2200 WWAN upg Express 54 None None 6 Win7 Pro64 10/12
2469-32U i5-3230M 4GBx1 14.0" HD 720p 320G 7200 Intel 6205 WWAN upg None Finger BT 6 Win7 Pro64 02/13
2469-2ZU i5-3230M 4GBx1 14.0" HD 720p 320G 7200 Intel 2200 WWAN upg None None None 6 Win7 Pro64 02/13
2469-2YU i5-3320M 4GBx1 14.0" HD 720p 320G 7200 Intel 6205 WWAN upg None Finger BT 6 Win7 Pro64 02/13
2469-2XU i5-3320M 4GBx1 14.0" HD 720p 320G 7200 Intel 6205 WWAN upg None None None 6 Win7 Pro64 02/13
2469-2WU i5-3320M 4GBx1 14.0" HD 720p 320G 7200 WLAN upg WWAN upg None Finger BT 6 Win7 Pro64 02/13
I second PDFBox, as it works similar to my own hand-written utility: interrogate (x,y) positions, sort, then paste together "likely" strings and insert a tab when the horizontal space is larger than one would reasonably expect.
I even got the little Scissors in Zapf Dingbats :)
Upvotes: 1
Reputation: 121
@alex-woolford: In general, perfect extraction of data (with or without the same formatting that you see in the PDF) is not always possible, thought it is, to some extent less than 100%. I'm saying this based on having worked on a similar project to yours, earlier. I came across similar issues to what you have, and some research on the Net showed that PDF in general is not a perfectly reversible format, i.e. it is not always possible to recover the text and format from a PDF with 100% accuracy. Sometimes characters even get lost, or transposed, and so on, during the extraction process (using some library). This seems to be due to the very nature of the PDF format and specification. It is not a text-based format. It is a derivative of PostScript and has some weird rules about layout of data. And this is according to official PDF documents, or according to the sites of product companies who have been working with PDF for a long time, and whose products are well known.
If less than perfect accuracy is tolerable, there are some products available (thought I don't know of any for Python, as of now). One is xpdf and another is PDFTextStream. I've used the former, not the latter. xpdf is a C library and also has command-line tools. PDFTextStream is a Java tool/library. It was a paid product earlier, but last I checked, it is now free for single-threaded applications, IIRC.
Even though xpdf is for C and PDFTextStream is for Java, you could call them from Python via XML-RPC or some other distributed computing / cross-language communication approach such as sockets. Some work would be involved, for that, of course.
HTH.
Upvotes: 1