How to extract text from pdf line by line in python 2.7

Question

I'm trying to read and parse a PDF file containing a table...

This is the table in the PDF:

and this is my code:

import PyPDF2
import re
from PyPDF2 import PdfFileReader , PdfFileWriter
FileRead = open("C:\Users\Zahraa Jawad\S40rooms.pdf", 'rb')
pdfReader = PyPDF2.PdfFileReader(FileRead)
pdfwriter = PdfFileWriter()
for page in pdfReader.pages:
    print page.extractText()

What I want is to read each line ( split ) in the table separately and save all information in the line ( YEAR, SEMESTER, ROOM, DAY, COURSE NO, INSTRUCTOR, TIME FROM, TIME TO, NUMBER OF STUDENTS ) in an array. After each ' ', I'd like to save the data in a new index in the array.

However, my code does not work; it reads all the information and returns it as a paragraph! I don't know how to split each line.

For example ( See the PDF above ):

341 458 01 Gazwa Sleebekh UTH 09:00 09:50 30

Output: YEAR, SEMESTER, ROOM, DAY, COURSE NO, INSTRUCTOR, TIME FROM, TIME TO, NUMBER OF STUDENTS

2015/2016, Second, S40-021, U, 341, Ghazwa Sleebekh, 09:00, 09:50, 30 2015/2016, Second, S40-021, T, 341, Ghazwa Sleebekh, 09:00, 09:50, 30 2015/2016, Second, S40-021, H, 341, Ghazwa Sleebekh, 09:00, 09:50, 30

It's split by the UTH ( Day ) but my problem is how to read each line in the PDF and search within it using a regular expression :)

Roland Smith · Accepted Answer

In converting PDF to text I've had the best results with using pdftotext from the poppler utilities. (You can find ms-windows binaries in several places [1], [2].)

import subprocess

def pdftotext(pdf, page=None):
    """Retrieve all text from a PDF file.

    Arguments:
        pdf Path of the file to read.
        page: Number of the page to read. If None, read all the pages.

    Returns:
        A list of lines of text.
    """
    if page is None:
        args = ['pdftotext', '-layout', '-q', pdf, '-']
    else:
        args = ['pdftotext', '-f', str(page), '-l', str(page), '-layout',
                '-q', pdf, '-']
    try:
        txt = subprocess.check_output(args, universal_newlines=True)
        lines = txt.splitlines()
    except subprocess.CalledProcessError:
        lines = []
    return lines

Note that text extraction only works if the PDF file actually contains text! Some PDF files only contain scanned images of text, in which case you'll need an OCR solution.

How to extract text from pdf line by line in python 2.7

Answers (1)

Related Questions