Reputation: 11
I am incredibly new to python, so I might not have the right terminology...
I've extracted text from a pdf using pdfplumber. That's been saved as a object. The code I used for that is:
with pdfplumber.open('Bell_2014.pdf') as pdf:
page = pdf.pages[0]
bell = page.extract_text()
print(bell)
So "bell" is all of the text from the first page of the imported PDF. what bell looks like I need to write all of that text as a string to a csv. I tried using:
with open('Bell_2014_ex.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(bell)
and
bell_ex = 'bell_2014_ex.csv'
with open(bell_ex, 'w', newline='') as csvfile:
file_writer = csv.writer(csvfile,delimiter=',')
file_writer.writerow(bell)
All I keep finding when I search this is how to create a csv with specific characters or numbers, but nothing from an output of an already executed code. For instance, I can get the above code:
bell_ex = 'bell_2014_ex.csv'
with open(bell_ex, 'w', newline='') as csvfile:
file_writer = csv.writer(csvfile,delimiter=',')
file_writer.writerow(['bell'])
to create a csv that has "bell" in one cell of the csv, but that's as close as I can get. I feel like this should be super easy, but I just can't seem to get it to work. Any thoughts? Please and thank you for helping my inexperienced self.
Upvotes: 0
Views: 609
Reputation: 11
So my problem was that I was missing the "encoding = 'utf-8'" for special characters and my delimiter need to be a space instead of a comma. What ended up working was:
from pdfminer.high_level import extract_text
object = extract_text('filepath.pdf')
print(object)
new_csv = 'filename.csv'
with open(new_csv, 'w', newline='', encoding = 'utf-8') as csvfile:
file_writer = csv.writer(csvfile,delimiter=' ')
file_writer.writerow(object)
However, since a lot of my pdfs weren't true pdfs but scans, the csv ended up having a lot of weird symbols. This worked for about half of the pdfs I have. If you have true pdfs, this will be great. If not, I'm currently trying to figure out how to extract all the text into a pandas dataframe separated by headers within the pdfs since pdfminer extracted all text perfectly. Thank you for everyone that helped!
Upvotes: 0
Reputation: 450
Some similar code I wrote recently converts a tab-separated file to csv for insertion into sqlite3 database:
Maybe this is helpful:
retval = ''
mode = 'r'
out_file = os.path.join('input', 'listfile.csv')
"""
Convert tab-delimited listfile.txt to comma separated values (.csv) file
"""
in_text = open(listfile.txt, 'r')
in_reader = csv.reader(in_text, delimiter='\t')
out_csv = open(out_file, 'w', newline='\n')
out_writer = csv.writer(out_csv, dialect=csv.excel)
for _line in in_reader:
out_writer.writerow(_line)
out_csv.close()
... and that's it, not too tough
Upvotes: 0
Reputation: 3105
page.extract_text()
is defined as: "Collates all of the page's character objects into a single string." which would make bell
just a very long string.
The CSV writerow()
expects by default a list of strings, with each item in the list corresponding to a single column.
Your main issue is a type mismatch, you're trying to write a single string where a list of strings is expected. You will need to further operate on your bell
object to convert it into a format acceptable to be written to a CSV.
Without having any knowledge of what bell
contains or what you intend to write, I can't get any more specific, but documentation on Python's CSV module is very comprehensive in terms of settings delimiters, dialects, column definitions, etc. Once you have converted bell
into a proper iterable of lists of strings, you can then write it to a CSV.
Upvotes: 1