Reputation: 71

How to read tables in pdf when there is line breaks in table by Python tabula-py?

I tried to use Python package, tabula-py to read table in pdf, It seems that line breaks in pdf table cells would separate the contents in the original cell into multiple cells.

I tried to search for all kinds of python packages to solve this problem. It seems that tabula-py is the most steady package to convert pdf table into pandas data. However, if this problem cannot be solved, I have to turn to online service, which would produce ideal excel output for me.

from tabula import read_pdf
df=read_pdf("C:/Users/Desktop/test.pdf", pages='all')

I expected the pdf table can be converted correctly with this.

Upvotes: 6

Answers (3)

Tarik

Reputation: 21

I advice you to use the parameter 'lattice', like this the line break will be replace by \n. And other way is to store the table in json file and load it in a dataframe to be sure you keep the column name with the line break.

# Use Tabula to extract table in a specific page and save it in json files
for i, table in enumerate(tabula.read_pdf(pdf_path, pages="85", multiple_tables=True, lattice=True)):
    table.to_json(str(i) + "_.json")

Load the json file example

data_test = pd.read_json("2_.json")
data_test.head()

Output : data_test head of table

Upvotes: 0

Michael Bergstrom

Reputation: 80

Tabula no longer has 'spreadsheet' as an option. Instead use 'lattice' option to avoid the line breaks separating into new rows. Code like this:

import tabula

# Read pdf into DataFrame
df = tabula.read_pdf("FDA EPC Text Phrases  (updated March 2018.pdf", pages='all', 
lattice=True)
print(df)

Upvotes: 5

ALFAFA

Reputation: 648

You can use 'spreadsheet' option with value 'True' to omit multiple rows of NAN value caused by line breaks.

import tabula

# Read pdf into DataFrame
df = tabula.read_pdf("FDA EPC Text Phrases  (updated March 2018.pdf", pages='all', spreadsheet=True)
print(df)
#print(df['Active Moiety Name'])
#print(df['FDA Established Pharmacologic Class\r(EPC) Text Phrase\rPLR regulations require that the following\rstatement is included in the Highlights\rIndications and Usage heading if a drug is a\rmember of an EPC [see 21 CFR\r201.57(a)(6)]: “(Drug) is a (FDA EPC Text\rPhrase) indicated for [indication(s)].” For\reach listed active moiety, the associated\rFDA EPC text phrase is included in this\rdocument. For more information about how\rFDA determines the EPC Text Phrase, see\rthe 2009 "Determining EPC for Use in the\rHighlights" guidance and 2013 "Determining\rEPC for Use in the Highlights" MAPP\r7400.13.'])

Output:

1758                                         ziconotide                  N-type calcium channel antagonist                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
1759                                         zidovudine  HIV nucleoside analog reverse transcriptase in...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
1760                                           zileuton                           5-lipoxygenase inhibitor                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
1761                                        zinc cation                        copper absorption inhibitor                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
1762                                        ziprasidone                             atypical antipsychotic                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
1763                                    zoledronic acid                                     bisphosphonate                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
1764                          zoledronic acid anhydrous                                     bisphosphonate                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
1765                                       zolmitriptan     serotonin 5-HT1B/1D receptor agonist (triptan)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
1766                                       zolmitriptan     serotonin 5-HT1B/1D receptor agonist (triptan)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
1767                                           zolpidem           gamma-aminobutyric acid (GABA) A agonist                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
1768                                         zonisamide                           antiepileptic drug (AED)

Upvotes: 0

How to read tables in pdf when there is line breaks in table by Python tabula-py?

Answers (3)

Related Questions