Reputation: 17017
I'm trying to use the PDF document parser called Textricator. It can use 3 different methods for parsing a PDF with some common OCR libraries. (itext5, itext7, pdfbox) The available methods are: text
, table
and form
. Text for normal raw OCR recognition, table to read out structured table data, and form for parsing less structured forms, using a Finite State Machine (FSM).
However, I am not able to use the form parser. Perhaps I simply don't understand how to organize the many configuration states. The documentation is lacking a simple form example, and someone recently posted an attempt to read a very basic table using the form
method, but was not able to. I also gave it a shot, but without any success.
Q: Can someone help me configure the state machine in the YML file?
(This is used to parse the demo file from one of that repo's issues, and shown in the copied screenshot below.)
The YML configuration file.
extractor: "pdf.pdfbox"
header:
default: 100
footer:
default: 600
maxRowDistance: 2
rootRecordType: item
recordTypes:
item:
label: "item"
valueTypes:
- item
- date
- description
- order_number
- quantity
- price
valueTypes:
item:
label: "Item"
date:
label: "Date"
description:
label: "Description"
order_number:
label: "OrderNo"
quantity:
label: "Qty"
price:
label: "Price"
initialState: "INIT"
states:
INIT:
transitions:
-
condition: item
nextState: item
item:
startRecord: true
transitions:
-
condition: date
nextState: date
date:
include: true
transitions:
-
condition: description
nextState: description
description:
include: true
transitions:
-
condition: description
nextState: description
-
condition: order_number
nextState: order_number
-
condition: quantity
nextState: quantity
order_number:
include: true
transitions:
-
condition: order_number
nextState: order_number
-
condition: quantity
nextState: quantity
quantity:
include: true
transitions:
-
condition: price
nextState: price
price:
include: true
transitions:
-
condition: end
nextState: end
end:
include: false
transitions:
-
condition: any
nextState: end
conditions:
item: '73 < ulx < 110 and text =~ /(\\d)*/'
date: '110 < ulx < 181 and text =~ /([0-9\-]*)/'
description: '193 < ulx < 366'
# order_number: '12 <= uly_rel <= 16 and text =~ ^.+/((\d{6})\-)((\d{2}))/'
order_number: '12 <= uly_rel <= 16 and text =~ ^.+((\d{6})\-)((\d{2}))'
quantity: '393 < ulx < 459'
price: '459 < ulx < 523'
end: 'text =~ /(Footer)/'
any: "1 = 1"
You may wonder why I am insisting in using the form processor for this simple example, but it is because in my real life document I will have a much more complex sub-structure of child items under the Description field. This can only (?) be processed efficiently by a state-machine, AFAIK.
But, maybe this is not the right tool for the job? So what other options are there?
UPDATE: (2021-05-18)
The author of Textricate has now bumped the libraries used, the documentation and corrected several working examples and user issues. Thanks to user mweber I now have a perfectly working parser and no longer need to use awk to handle weird columns.
Upvotes: 0
Views: 282
Reputation: 688
As Textricator is kind of a hidden gem for pdf parsing imo, I'm happy to see someone using it and posted a config working with the sample document to the github issue:
extractor: "pdf.pdfbox"
header:
default: 100
footer:
default: 600
maxRowDistance: 2
rootRecordType: item
recordTypes:
item:
label: "item"
valueTypes:
- item
- date
- description
- order_number
- quantity
- price
valueTypes:
item:
label: "Item"
date:
label: "Date"
description:
label: "Description"
order_number:
label: "OrderNo"
quantity:
label: "Qty"
price:
label: "Price"
initialState: "INIT"
states:
INIT:
include: false
transitions:
-
condition: item
nextState: item
- condition: any
nextState: INIT
item:
startRecord: true
transitions:
-
condition: date
nextState: date
date:
include: true
transitions:
-
condition: description
nextState: description
description:
include: true
transitions:
-
condition: description
nextState: description
-
condition: order_number
nextState: order_number
-
condition: quantity
nextState: quantity
-
condition: item
nextState: item
order_number:
include: true
transitions:
-
condition: order_number
nextState: order_number
-
condition: quantity
nextState: quantity
quantity:
include: true
transitions:
-
condition: price
nextState: price
price:
include: true
transitions:
-
condition: end
nextState: end
-
condition: description
nextState: description
-
condition: item
nextState: item
end:
include: false
transitions:
-
condition: any
nextState: end
conditions:
item: '73 < ulx < 110 and text =~ /(\\d)*/'
date: '110 < ulx < 181 and text =~ /([0-9\\-]*)/'
description: '193 < ulx < 366'
order_number: '12 <= uly_rel <= 16 and text =~ /^.+(([0-9]{6})\\-)(([0-9]{2}))/'
quantity: '393 < ulx < 459'
price: '459 < ulx < 523'
end: 'text =~ /(Footer)/'
any: "1 = 1"
Upvotes: 1