Reputation: 53
I'm trying to read and parse a PDF file containing a table...
This is the table in the PDF:
and this is my code:
import PyPDF2
import re
from PyPDF2 import PdfFileReader , PdfFileWriter
FileRead = open("C:\\Users\\Zahraa Jawad\\S40rooms.pdf", 'rb')
pdfReader = PyPDF2.PdfFileReader(FileRead)
pdfwriter = PdfFileWriter()
for page in pdfReader.pages:
print page.extractText()
What I want is to read each line ( split ) in the table separately and save all information in the line ( YEAR, SEMESTER, ROOM, DAY, COURSE NO, INSTRUCTOR, TIME FROM, TIME TO, NUMBER OF STUDENTS ) in an array. After each '\n', I'd like to save the data in a new index in the array.
However, my code does not work; it reads all the information and returns it as a paragraph! I don't know how to split each line.
For example ( See the PDF above ):
341 458 01 Gazwa Sleebekh UTH 09:00 09:50 30
Output: YEAR, SEMESTER, ROOM, DAY, COURSE NO, INSTRUCTOR, TIME FROM, TIME TO, NUMBER OF STUDENTS
2015/2016, Second, S40-021, U, 341, Ghazwa Sleebekh, 09:00, 09:50, 30 2015/2016, Second, S40-021, T, 341, Ghazwa Sleebekh, 09:00, 09:50, 30 2015/2016, Second, S40-021, H, 341, Ghazwa Sleebekh, 09:00, 09:50, 30
It's split by the UTH ( Day ) but my problem is how to read each line in the PDF and search within it using a regular expression :)
Upvotes: 1
Views: 4867
Reputation: 43495
In converting PDF to text I've had the best results with using pdftotext
from the poppler utilities. (You can find ms-windows binaries in several places [1], [2].)
import subprocess
def pdftotext(pdf, page=None):
"""Retrieve all text from a PDF file.
Arguments:
pdf Path of the file to read.
page: Number of the page to read. If None, read all the pages.
Returns:
A list of lines of text.
"""
if page is None:
args = ['pdftotext', '-layout', '-q', pdf, '-']
else:
args = ['pdftotext', '-f', str(page), '-l', str(page), '-layout',
'-q', pdf, '-']
try:
txt = subprocess.check_output(args, universal_newlines=True)
lines = txt.splitlines()
except subprocess.CalledProcessError:
lines = []
return lines
Note that text extraction only works if the PDF file actually contains text! Some PDF files only contain scanned images of text, in which case you'll need an OCR solution.
Upvotes: 4