Reputation: 635
Every month I need extract some data from .pdf files to create an Excel table.
I'm able to convert the .pdf file to text but I'm not sure how to extract and save the specific information I want. Now I have this code:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
fstr = ''
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
str = retstr.getvalue()
fstr += str
fp.close()
device.close()
retstr.close()
return fstr
print convert_pdf_to_txt("FA20150518.pdf")
And this is the result:
>>>
AVILA 72, VALLDOREIX
08197 SANT CUGAT DEL VALLES
(BARCELONA)
TELF: 935441851
NIF: B65512725
EMAIL: [email protected]
JOSE LUIS MARTINEZ LOPEZ
AVDA. DEL ESLA, 33-D
24240 SANTA MARIA DEL PARAMO
LEON
TELF: 600871170
FECHA
17/06/15
FACTURA
20150518
CLIENTE
43000335
N.I.F.
71548163 B
PÁG.
1
Nº VIAJE
RUTA
DESTINATARIO / REFERENCIA
KG
BULTOS
IMPORTE
2015064210-08/06/15
CERDANYOLA DEL VALLES -> VINAROS
FERRER ALIMENTACION - VINAROZ
2,000.0
1
150,00
TOTAL IMP.
%
IMPORTE
BASE
150,00
150,00
%
21,00
IVA
%
REC.
TOTAL FRA.
(€)
31,50
181,50
Eur
Forma Pago:
Banco:
CONTADO
Vencimientos:
17/06/15
181,50
Ok, now I have the text in the variable convert_pdf_to_txt.
I want extract this information: Customer, Number of bill, Price, expiration date and way to pay.
Customer name always are down "EMAIL: [email protected]"
Number of bill always are down "FACTURA"
Price always are down two lines "Vencimientos:"
Expiration date always are down "Vencimientos:"
Way to pay always down "Banco:"
I think in do something like this. If I can convert this text into a list and can do something like this:
Searching Customer:
i=0
while i < lengthlist
if listitem[i] == "EMAIL: [email protected]"
i+1
Customer = listitem[i]
i = lengthlist
else:
i+1
Searching bill Number:
i=0
while i < lengthlist
if listitem[i] == "FACTURA"
i+1
Customer = listitem[i]
i = lengthlist
else:
i+1
After I don't know how to save in Excel but I'm sure I can find examples in the forum but first I need to extract only this data.
Upvotes: 3
Views: 27031
Reputation: 635
Thanks for your help I take code from two examples you give me and now I can extract all info I want.
# -*- coding: cp1252 -*-
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
fstr = ''
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
str = retstr.getvalue()
fstr += str
fp.close()
device.close()
retstr.close()
return fstr
factura = "FA20150483.pdf"
#ejemplo 1
string = convert_pdf_to_txt(factura)
lines = list(filter(bool,string.split('\n')))
custData = {}
for i in range(len(lines)):
if 'EMAIL:' in lines[i]:
custData['Name'] = lines[i+1]
elif 'FACTURA' in lines[i]:
custData['BillNumber'] = lines[i+1]
elif 'Vencimientos:' in lines[i]:
custData['price'] = lines[i+2]
elif 'Banco:' in lines[i]:
custData['paymentType'] = lines[i+1]
#ejemplo 2
txtList = convert_pdf_to_txt(factura).splitlines()
nameIdx, billNumIdx, priceIdx, expirDateIdx, paymentIdx = -1, -1, -1, -1, -1
for idx, line in enumerate(txtList):
if line == "EMAIL: [email protected]":
nameIdx = idx +2 # in your example it should be +2...
if line == "FACTURA":
billNumIdx = idx + 1
if "Vencimientos:" in line:
priceIdx = idx + 2
expirDateIdx = idx + 1
if "Banco:" in line:
paymentIdx = idx + 1
name = txtList[nameIdx] if nameIdx != -1 else ''
billNum = txtList[billNumIdx] if billNumIdx != -1 else ''
price = txtList[priceIdx] if priceIdx != -1 else ''
expirDate = txtList[expirDateIdx] if expirDateIdx != -1 else ''
payment = txtList[paymentIdx] if paymentIdx != -1 else ''
print expirDate
billNum = billNum.replace("Â Â ", "")
print billNum
custData['Name'] = custData['Name'].replace("Â", "")
print custData['Name']
custData['paymentType'] = custData['paymentType'].replace("Â", "")
print custData['paymentType']
print price
Few examples:
>>>
25/06/15
20150480
BABY RACE S.L.
REMESA DIA 25 FECHA FACTURA
15,23
>>> ================================ RESTART ================================
>>>
05/06/15
20150481
LOFT CUINA, S.L.
DIA 5 FECHA FACTURA
91,79
>>> ================================ RESTART ================================
>>>
05/06/15
20150482
GRAFIQUES MOGENT S.L.
DIA 5 FECHA FACTURA
128,42
>>> ================================ RESTART ================================
>>>
30/06/15
20150483
CHIEMIVALL SL
30 DIAS FECHA FACTURA
1.138,58
>>>
Upvotes: 1
Reputation: 1290
You had the right idea
string = convert_pdf_to_txt("FA20150518.pdf")
lines = list(filter(bool,string.split('\n')))
custData = {}
for i in range(len(lines)):
if 'EMAIL:' in lines[i]:
custData['Name'] = lines[i+1]
elif 'FACTURA' in lines[i]:
custData['BillNumber'] = lines[i+1]
elif 'Vencimientos:' in lines[i]:
custData['price'] = lines[i+2]
elif 'Banco:' in lines[i]:
custData['paymentType'] = lines[i+1]
print(custData)
Upvotes: 4
Reputation: 571
Try something like this:
txtList = convert_pdf_to_txt("FA20150518.pdf").splitlines()
nameIdx, billNumIdx, priceIdx, expirDateIdx, paymentIdx = -1, -1, -1, -1, -1
for idx, line in enumerate(txtList):
if "EMAIL: [email protected]" in line:
nameIdx = idx + 1 # in your example it should be +2...
if "FACTURA" in line:
billNumIdx = idx + 1
if "Vencimientos:" in line:
priceIdx = idx + 2
expirDateIdx = idx + 1
if "Banco:" in line:
paymentIdx = idx + 1
name = txtList[nameIdx] if nameIdx != -1 else ''
billNum = txtList[billNumIdx] if billNumIdx != -1 else ''
price = txtList[priceIdx] if priceIdx != -1 else ''
expirDate = txtList[expirDateIdx] if expirDateIdx != -1 else ''
payment = txtList[paymentIdx] if paymentIdx != -1 else ''
If you are sure that the key lines only contain what you are looking for ("FACTURA" and so on) you can replace the conditions with
if line == "FACTURA":
Upvotes: 1
Reputation: 6065
Let's take a simpler example, that I hope represent your issue.
You have a string stringPDF
like this:
name1 \n
\n
value1 \n
name2 \n
value2 \n
\n
name3 \n
otherValue \n
value3 \n
A value is X lines after a name (in your example X is often 1, sometimes 2, but let's just say it can be any number). \n
represent the line breaks (when you print the string, it prints on multiple lines)
First, we convert the string to a list of lines, by splitting where there are line breaks:
>>> stringList=stringPDF.split("\n")
>>> print(stringList)
['name1 ', '', 'value1 ', 'name2 ', 'value2 ', '', 'name3 ', 'otherValue ', 'value3 ', '']
Depending on your string, you may need to clean it. Here I have some extra whitespace at the end ('name1 '
instead of 'name1'
). I use a list comprehension and strip()
to remove it:
stringList=[line.strip() for line in stringList]
Once we have a proper list, we can define a simple function that return a value, given the name and X (X lines between name and value):
def get_value(l,name,Xline):
indexName=l.index(name) #find the index of the name in the list
indexValue=indexName+Xline # add X to this index
return l[indexValue] #get the value
>>>print(get_value(stringList,"name2",1))
"value2"
Upvotes: 1