Reputation: 693
I'm having a hard time getting a piece of code to work. I want to loop through pdf files in a folder, extract what the tabula package thinks the tables are, extract these to a dataframe, and write all the tables from a specific pdf into a one csv file.
I looked at this post (and several others) but still I have problems getting it to work. It seems that the script loops through the files, extract some tables, but it doesn't seem to iterate over the files, and I can't get it to write all the dataframes in to a csv file. The script just writes the last one in the csv.
This is what I have so far. Any help would be greatly appreciated, specifically, how to loop correctly through the files and to write all tables from one pdf into one csv file. I'm pretty stuck...
pdf_folder = 'C:\\PDF extract\\pdf\\'
csv_folder = 'C:\\PDF extract\\csv\\'
paths = [pdf_folder + fn for fn in os.listdir(pdf_folder) if fn.endswith('.pdf')]
for path in paths:
listdf = tabula.read_pdf(path, encoding = 'latin1', pages = 'all', nospreadsheet = True,multiple_tables=True)
path = path.replace('pdf', 'csv')
for df in listdf: (df.to_csv(path, index = False))
Upvotes: 2
Views: 2385
Reputation: 131
Just like @Scott Hunter mentioned, you are not using CSV_folder
Also, I think you are overwriting the created .csv files:
for df in listdf: (df.to_csv(path, index = False))
For each iteration of the for-loop, the path variable stays the same.
Edit: You should probably try to do something like this:
pdf_folder = 'C:\\PDF extract\\pdf\\'
paths = [pdf_folder + fn for fn in os.listdir(pdf_folder) if fn.endswith('.pdf')]
for path in paths:
listdf = tabula.read_pdf(path, encoding = 'latin1', pages = 'all', nospreadsheet = True,multiple_tables=True)
path = path.replace('pdf', 'csv')
df_concat = pd.concat(listdf)
df_concat.to_csv(path, index = False)
Upvotes: 1