DMM
DMM

Reputation: 11

How do I get python to write a csv file from the output my code?

I am incredibly new to python, so I might not have the right terminology...

I've extracted text from a pdf using pdfplumber. That's been saved as a object. The code I used for that is:

with pdfplumber.open('Bell_2014.pdf') as pdf:
    page = pdf.pages[0]
    bell = page.extract_text()
    print(bell)

So "bell" is all of the text from the first page of the imported PDF. what bell looks like I need to write all of that text as a string to a csv. I tried using:

 with open('Bell_2014_ex.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(bell)

and

bell_ex = 'bell_2014_ex.csv'

with open(bell_ex, 'w', newline='') as csvfile:
   file_writer = csv.writer(csvfile,delimiter=',')
   file_writer.writerow(bell)

All I keep finding when I search this is how to create a csv with specific characters or numbers, but nothing from an output of an already executed code. For instance, I can get the above code:

bell_ex = 'bell_2014_ex.csv'

with open(bell_ex, 'w', newline='') as csvfile:
   file_writer = csv.writer(csvfile,delimiter=',')
   file_writer.writerow(['bell'])

to create a csv that has "bell" in one cell of the csv, but that's as close as I can get. I feel like this should be super easy, but I just can't seem to get it to work. Any thoughts? Please and thank you for helping my inexperienced self.

Upvotes: 0

Views: 609

Answers (3)

DMM
DMM

Reputation: 11

So my problem was that I was missing the "encoding = 'utf-8'" for special characters and my delimiter need to be a space instead of a comma. What ended up working was:

from pdfminer.high_level import extract_text
object = extract_text('filepath.pdf')
print(object)

new_csv = 'filename.csv'

with open(new_csv, 'w', newline='', encoding = 'utf-8') as csvfile:
    file_writer = csv.writer(csvfile,delimiter=' ')
    file_writer.writerow(object)

However, since a lot of my pdfs weren't true pdfs but scans, the csv ended up having a lot of weird symbols. This worked for about half of the pdfs I have. If you have true pdfs, this will be great. If not, I'm currently trying to figure out how to extract all the text into a pandas dataframe separated by headers within the pdfs since pdfminer extracted all text perfectly. Thank you for everyone that helped!

Upvotes: 0

FisheyJay
FisheyJay

Reputation: 450

Some similar code I wrote recently converts a tab-separated file to csv for insertion into sqlite3 database:

Maybe this is helpful:

    retval = ''
    mode = 'r'
    out_file = os.path.join('input', 'listfile.csv')

    """
    Convert tab-delimited listfile.txt to comma separated values (.csv) file
    """

    in_text = open(listfile.txt, 'r')
    in_reader = csv.reader(in_text, delimiter='\t')
    out_csv = open(out_file, 'w', newline='\n')
    out_writer = csv.writer(out_csv, dialect=csv.excel)

    for _line in in_reader:
        out_writer.writerow(_line)
    out_csv.close()

... and that's it, not too tough

Upvotes: 0

Chase
Chase

Reputation: 3105

page.extract_text() is defined as: "Collates all of the page's character objects into a single string." which would make bell just a very long string.

The CSV writerow() expects by default a list of strings, with each item in the list corresponding to a single column.

Your main issue is a type mismatch, you're trying to write a single string where a list of strings is expected. You will need to further operate on your bell object to convert it into a format acceptable to be written to a CSV.

Without having any knowledge of what bell contains or what you intend to write, I can't get any more specific, but documentation on Python's CSV module is very comprehensive in terms of settings delimiters, dialects, column definitions, etc. Once you have converted bell into a proper iterable of lists of strings, you can then write it to a CSV.

Upvotes: 1

Related Questions