Reputation: 635

Extract specific data from .pdf and save in Excel file

Every month I need extract some data from .pdf files to create an Excel table.

I'm able to convert the .pdf file to text but I'm not sure how to extract and save the specific information I want. Now I have this code:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    fstr = ''
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,    password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

        str = retstr.getvalue()
        fstr += str

    fp.close()
    device.close()
    retstr.close()
    return fstr

print convert_pdf_to_txt("FA20150518.pdf")

And this is the result:

    >>> 
AVILA 72, VALLDOREIX
08197 SANT CUGAT DEL VALLES
(BARCELONA)
TELF: 935441851
NIF: B65512725
EMAIL: [email protected]

JOSE LUIS MARTINEZ LOPEZ

AVDA. DEL ESLA, 33-D
24240 SANTA MARIA DEL PARAMO
LEON
TELF: 600871170

FECHA
17/06/15

FACTURA
  20150518

CLIENTE
43000335

N.I.F.

71548163 B

PÁG.

1

Nº VIAJE

RUTA

DESTINATARIO / REFERENCIA

KG

BULTOS

IMPORTE

2015064210-08/06/15

CERDANYOLA DEL VALLES -> VINAROS

FERRER ALIMENTACION - VINAROZ

2,000.0

1

         150,00

TOTAL IMP.

%

IMPORTE

BASE

         150,00

         150,00

%
 21,00

IVA

%

REC.

TOTAL FRA.

(€)

          31,50

         181,50

Eur

Forma Pago:
Banco:

CONTADO

Vencimientos:
17/06/15
181,50

Ok, now I have the text in the variable convert_pdf_to_txt.

I want extract this information: Customer, Number of bill, Price, expiration date and way to pay.

Customer name always are down "EMAIL: [email protected]"

Number of bill always are down "FACTURA"

Price always are down two lines "Vencimientos:"

Expiration date always are down "Vencimientos:"

Way to pay always down "Banco:"

I think in do something like this. If I can convert this text into a list and can do something like this:

Searching Customer:

 i=0
 while i < lengthlist
   if listitem[i] == "EMAIL: [email protected]"
      i+1
      Customer = listitem[i]
      i = lengthlist
   else:
     i+1

Searching bill Number:

 i=0
 while i < lengthlist
   if listitem[i] == "FACTURA"
      i+1
      Customer = listitem[i]
      i = lengthlist
   else:
     i+1

After I don't know how to save in Excel but I'm sure I can find examples in the forum but first I need to extract only this data.

Upvotes: 3

Answers (4)

Xavier Villafaina

Reputation: 635

Thanks for your help I take code from two examples you give me and now I can extract all info I want.

# -*- coding: cp1252 -*-
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    fstr = ''
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,    password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

        str = retstr.getvalue()
        fstr += str

    fp.close()
    device.close()
    retstr.close()
    return fstr


factura = "FA20150483.pdf"
#ejemplo 1

string = convert_pdf_to_txt(factura)
lines = list(filter(bool,string.split('\n')))
custData = {}
for i in range(len(lines)):
    if 'EMAIL:' in lines[i]:
        custData['Name'] = lines[i+1]
    elif 'FACTURA' in lines[i]:
        custData['BillNumber'] = lines[i+1]
    elif 'Vencimientos:' in lines[i]:
        custData['price'] = lines[i+2]
    elif 'Banco:' in lines[i]:
        custData['paymentType'] = lines[i+1]



#ejemplo 2
txtList = convert_pdf_to_txt(factura).splitlines()
nameIdx, billNumIdx, priceIdx, expirDateIdx, paymentIdx = -1, -1, -1, -1, -1

for idx, line in enumerate(txtList):
    if line == "EMAIL: [email protected]":
        nameIdx = idx +2 # in your example it should be +2...

    if line == "FACTURA":
        billNumIdx = idx + 1

    if "Vencimientos:" in line:
        priceIdx = idx + 2
        expirDateIdx = idx + 1

    if "Banco:" in line:
        paymentIdx = idx + 1

name = txtList[nameIdx] if nameIdx != -1 else ''
billNum = txtList[billNumIdx] if billNumIdx != -1 else ''
price = txtList[priceIdx] if priceIdx != -1 else ''
expirDate = txtList[expirDateIdx] if expirDateIdx != -1 else ''
payment = txtList[paymentIdx] if paymentIdx != -1 else ''


print expirDate

billNum = billNum.replace("Â Â ", "")


print billNum


custData['Name'] = custData['Name'].replace("Â", "")

print custData['Name']


custData['paymentType'] = custData['paymentType'].replace("Â", "")

print custData['paymentType']

print price

Few examples:

    >>> 
25/06/15
20150480
BABY RACE S.L.
REMESA DIA 25 FECHA FACTURA
15,23
>>> ================================ RESTART ================================
>>> 
05/06/15
20150481
LOFT CUINA, S.L.
DIA 5 FECHA FACTURA
91,79
>>> ================================ RESTART ================================
>>> 
05/06/15
20150482
GRAFIQUES MOGENT S.L.
DIA 5 FECHA FACTURA
128,42
>>> ================================ RESTART ================================
>>> 
30/06/15
20150483
CHIEMIVALL SL
30 DIAS FECHA FACTURA
1.138,58
>>>

Upvotes: 1

Riet

Reputation: 1290

You had the right idea

string = convert_pdf_to_txt("FA20150518.pdf")
lines = list(filter(bool,string.split('\n')))
custData = {}
for i in range(len(lines)):
    if 'EMAIL:' in lines[i]:
        custData['Name'] = lines[i+1]
    elif 'FACTURA' in lines[i]:
        custData['BillNumber'] = lines[i+1]
    elif 'Vencimientos:' in lines[i]:
        custData['price'] = lines[i+2]
    elif 'Banco:' in lines[i]:
        custData['paymentType'] = lines[i+1]
print(custData)

Upvotes: 4

Michel

Reputation: 571

Try something like this:

txtList = convert_pdf_to_txt("FA20150518.pdf").splitlines()
nameIdx, billNumIdx, priceIdx, expirDateIdx, paymentIdx = -1, -1, -1, -1, -1

for idx, line in enumerate(txtList):
    if "EMAIL: [email protected]" in line:
        nameIdx = idx + 1 # in your example it should be +2...

    if "FACTURA" in line:
        billNumIdx = idx + 1

    if "Vencimientos:" in line:
        priceIdx = idx + 2
        expirDateIdx = idx + 1

    if "Banco:" in line:
        paymentIdx = idx + 1

name = txtList[nameIdx] if nameIdx != -1 else ''
billNum = txtList[billNumIdx] if billNumIdx != -1 else ''
price = txtList[priceIdx] if priceIdx != -1 else ''
expirDate = txtList[expirDateIdx] if expirDateIdx != -1 else ''
payment = txtList[paymentIdx] if paymentIdx != -1 else ''

If you are sure that the key lines only contain what you are looking for ("FACTURA" and so on) you can replace the conditions with

if line == "FACTURA":

Upvotes: 1

Mel

Reputation: 6065

Let's take a simpler example, that I hope represent your issue.

You have a string stringPDF like this:

name1 \n
\n
value1 \n
name2 \n
value2 \n
\n
name3 \n
otherValue \n
value3 \n

A value is X lines after a name (in your example X is often 1, sometimes 2, but let's just say it can be any number). \n represent the line breaks (when you print the string, it prints on multiple lines)

First, we convert the string to a list of lines, by splitting where there are line breaks:

>>> stringList=stringPDF.split("\n")
>>> print(stringList)
['name1 ', '', 'value1 ', 'name2 ', 'value2 ', '', 'name3 ', 'otherValue ', 'value3 ', '']

Depending on your string, you may need to clean it. Here I have some extra whitespace at the end ('name1 ' instead of 'name1'). I use a list comprehension and strip() to remove it:

stringList=[line.strip() for line in stringList]

Once we have a proper list, we can define a simple function that return a value, given the name and X (X lines between name and value):

def get_value(l,name,Xline):
    indexName=l.index(name)  #find the index of the name in the list
    indexValue=indexName+Xline # add X to this index
    return l[indexValue]  #get the value

>>>print(get_value(stringList,"name2",1))
"value2"

Upvotes: 1

Extract specific data from .pdf and save in Excel file

Answers (4)

Related Questions