Reputation: 127
Expected behavior:
Read PDF, extract all table data into pandas df.
Actual behavior:
Reads PDF fine, extracts most table data and saves it to a debugging.txt with fp.write(df)
. One column (names) usually only returns '...' when I view the debugging.txt, or watch the terminal print it.
It's like 9/10 times returning ... - sometimes just the first page, but the rest are fine. Sometimes they're all ok... It seems weird.
(I may be an idiot and it might be shortening it because its by far the longest string by 2-3x. But my Google Fu is failing me)
Sample Input (Names covered for privacy):
Sample Output:
21 121 87 59 2003 ... NaN NaN NaN
22 122 86 59 2026 ... NaN NaN NaN
23 123 85 60 2038 ... NaN NaN NaN
24 124 84 60 2050 ... NaN NaN NaN
25 125 83 61 2056 ... NaN NaN NaN
26 126 82 61 2095 ... NaN NaN NaN
Code:
pagecount = 0
for filename in os.listdir(SPLITDIR):
print("Working on: {}".format(filename))
if not filename.endswith(".pdf"):
print("I dont think {} is a PDF".format(filename))
continue
pagedf = read_pdf(SPLITPATH.format(pagecount) pages='all')
#print(pagedf)
debugextract.write(str(pagedf))
pagedf = pd.DataFrame(pagedf)
print(pagedf)
pagecount += 1
Upvotes: 1
Views: 1050
Reputation: 495
This doesn't come from tabula but ipython or Jupyter's display setting.
See also https://github.com/chezou/tabula-py/issues/216#issuecomment-581837621
Upvotes: 2