stygarfield
stygarfield

Reputation: 127

Tabula-py returns '...' on one specific column in df. everything else seems to work,

Expected behavior:

Read PDF, extract all table data into pandas df.

Actual behavior:

Reads PDF fine, extracts most table data and saves it to a debugging.txt with fp.write(df). One column (names) usually only returns '...' when I view the debugging.txt, or watch the terminal print it.

It's like 9/10 times returning ... - sometimes just the first page, but the rest are fine. Sometimes they're all ok... It seems weird.

(I may be an idiot and it might be shortening it because its by far the longest string by 2-3x. But my Google Fu is failing me)

Sample Input (Names covered for privacy):

Sample Input

Sample Output:

21        121         87    59 2003  ...         NaN        NaN         NaN
22        122         86    59 2026  ...         NaN        NaN         NaN
23        123         85    60 2038  ...         NaN        NaN         NaN
24        124         84    60 2050  ...         NaN        NaN         NaN
25        125         83    61 2056  ...         NaN        NaN         NaN
26        126         82    61 2095  ...         NaN        NaN         NaN

Code:

pagecount = 0
for filename in os.listdir(SPLITDIR):

    print("Working on: {}".format(filename))

    if not filename.endswith(".pdf"):
        print("I dont think {} is a PDF".format(filename))
        continue

    pagedf = read_pdf(SPLITPATH.format(pagecount) pages='all')
    #print(pagedf)
    debugextract.write(str(pagedf))

    pagedf = pd.DataFrame(pagedf)
    print(pagedf)

    pagecount += 1

Upvotes: 1

Views: 1050

Answers (1)

chezou
chezou

Reputation: 495

This doesn't come from tabula but ipython or Jupyter's display setting.

See also https://github.com/chezou/tabula-py/issues/216#issuecomment-581837621

Upvotes: 2

Related Questions