Reputation: 1254
I am trying to read the text from a pdf file. This file is part of a generated report. I am amble to read the text in the file but it comes out very garbled. What I want is to get each line in the pdf file as an item in a list, eventually, but you can see that the field names and entries get all mixed up. An example of the pdf I am trying to important can be found here, and below is the code that I am trying to use to get the lines.
import PyPDF2
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
filename = 'U:/PLAN/BCUBRICH/Python/Network Plan/Page 1 from AMP380_1741500.pdf'
def getPDFContent(filename):
content = ""
p = open(filename, "rb")
pdf = PyPDF2.PdfFileReader(p)
pdf.
num_pages = pdf.getNumPages()
for i in range(0, num_pages):
content += pdf.getPage(i).extractText()+'\n'
# content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
content=getPDFContent(filename)
Here is the output I get:
Out:'''UNITED STATES ENVIRONMENTAL PROTECTION AGENCYAIR QUALITY SYSTEMSITE DESCRIPTION REPORTApr. 25, 2019Site ID: 49-003-0003
Site Name: Brigham City
Local ID: BR
140 W.FISHBURN DRIVE, BRIGHAM CITY, UTStreet Address: City: Brigham City
Utah Zip Code: 84302
State: Box ElderCounty: Monitoring PointLocation Description: SuburbanLocation Setting: Interpolation-MapColl. Method:ResidentialLand Use: 20000819Date Established: Date Terminated: 20190130Last Updated: HQ Eval. Date:Regional Eval. Date: UtahAQCR : Ogden-Clearfield, UTCBSA: Salt Lake City-Provo-Orem, UTCSA: Met. Site ID:Direct Met Site: On-Site Met EquipType Met Site: Dist to Met. Site(m): Local Region: Urban Area: Not in an urban area
EPA Region: Denver
17411City Population: Dir. to CBD: Dist. to City(km): 3000Census Block: 3Block Group: 960701Census Tract: 1Congressional District: Class 1 Area: +41.492707Site Latitude: -112.018863Site Longitude: MountainTime Zone: UTM Zone: UTM Northing: UTM Easting: Accuracy: 60.73
Datum: WGS84
Scale: 24000
Point/Line/Area: Point 1,334.0Vertical Measure(m): 0Vert Accuracy: UnknownVert Datum : Vert Method: Unknown
Owning Agency: 1113 Utah Department Of Environmental Quality SITE COMMENTS SITE FOR OZONE, PM2.5, AND MET ACTIVE MONITOR TYPES Primary Monitor Periods # of Parameter Code Poc Begin Date End Date Monitor Type Monitors 42602 1 20180126 OTHER 2 44201 1 20010501 SLAMS 16 88101 1 20000819 20141231 88101 1 20160101 20161231 88101 1 20180101 88101 3 20170101 20171231 88101 4 20150101 20151231 TANGENT ROADS Road Traffic Traffic Compass Number Road Name Count Year Traffic Volume Source Road Type Sector 1 FISHBURN DRIVE 450 2000 LOCAL ST OR HY S Page 1 of 77
'''
For Example, I would like the eighth item in the list to be
State: Utah Zip Code: 84302 County: Box Elder
but what I get is
Utah Zip Code: 84302 State: Box ElderCounty:
These kind of mix ups happen throughout the document.
Upvotes: 1
Views: 759
Reputation: 95918
This is merely an explanation why that happens, not a solution. But it is too long for a comment, so it got an answer...
The reason for this weird order is that the text chunks in the document drawn in that order.
If you dig into the PDF and look at the content stream, you find this segment responsible for the example line you picked:
/TD <</MCID 12 >>BDC
-47.25 -1.685 Td
(Utah )Tj
28.125 0 Td
[(Zip Code: )-190(84302)]TJ
-32.06 -0 Td
(State: )Tj
EMC
/TD <</MCID 13 >>BDC
56.81 0 Td
(Box Elder)Tj
-5.625 0 Td
(County: )Tj
EMC
You probably don't understand the instructions but can see that the strings (in round brackets (
...)
) come exactly in the order you observe in the output
Utah Zip Code: 84302 State: Box ElderCounty:
instead of the desired
State: Utah Zip Code: 84302 County: Box Elder
The Td instructions in-between make the text insertion point jump back and forth to achieve the different appearance in a viewer.
Apparently your text extraction method merely retrieves the strings from the content stream in the order they are drawn and ignores the actual locations at which they are drawn. For a proper text extraction, therefore, you have to change the method you use. As I don't really know PyPDF2 myself, I cannot say whether this library offers different text extraction methods to turn to or whether you have to resort to a different library.
Upvotes: 2