Reputation: 212
I want to take a PDF File as an input. And as an output file I want a csv file to show. So all the textual data which is there in the pdf file should be converted to a csv file. But I am not understanding how would this happen..I need your help at the earliest as I've tried to do but couldn't do it.
what ive done is used a library called Tabula-py which converts pdf to csv file. It does create a csv format but there are no contents being copied to the csv file from the pdf file.
heres the code
from tabula import convert_into,read_pdf
import tabula
df = tabula.read_pdf("crimestory.pdf", spreadsheet=True,
pages='all',output_format="csv")
df.to_csv('crimestoryy.csv', index=False)
the output should come as a csv file where the data is present. what i am getting is a blank csv file.
Upvotes: 2
Views: 3749
Reputation: 11
Tabula-py is used to read only tables in the pdf document
Upvotes: 1
Reputation: 212
I have find answer to this question by my own To tackle this issue I came up with converting the pdf file into a text file. Then I converted this text file to a csv file.here's my code.
conversion.py
import os.path
import csv
import pdftotext
#Load your PDF
with open("crimestory.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# Save all text to a txt file.
with open('crimestory.txt', 'w') as f:
f.write("\n\n".join(pdf))
save_path = "/home/mayureshk/PycharmProjects/NLP/"
completeName_in = os.path.join(save_path, 'crimestory' + '.txt')
completeName_out = os.path.join(save_path, 'crimestoryycsv' + '.csv')
file1 = open(completeName_in)
In_text = csv.reader(file1, delimiter=',')
file2 = open(completeName_out, 'w')
out_csv = csv.writer(file2)
file3 = out_csv.writerows(In_text)
file1.close()
file2.close()
Upvotes: 2
Reputation: 6143
Try this, hope it will works
import tabula
# convert PDF into CSV
tabula.convert_into("crimestory.pdf", "crimestory.csv", output_format="csv", pages='all')
or
df = tabula.read_pdf("crimestory.pdf", encoding='utf-8', spreadsheet=True, pages='all')
df.to_csv('crimestory.csv', encoding='utf-8')
or
from tabula import read_pdf
df = read_pdf("crimestory.pdf")
df
#make sure df displays your pdf contents in the output
from tabula import convert_into
convert_into("crimestory.pdf", "crimestory.csv", output_format="csv")
!cat.crimestory.csv
Upvotes: 1