How to set the FSM configuaration for Textricator PDF OCR reader?

Question

I'm trying to use the PDF document parser called Textricator. It can use 3 different methods for parsing a PDF with some common OCR libraries. (itext5, itext7, pdfbox) The available methods are: text, table and form. Text for normal raw OCR recognition, table to read out structured table data, and form for parsing less structured forms, using a Finite State Machine (FSM).

However, I am not able to use the form parser. Perhaps I simply don't understand how to organize the many configuration states. The documentation is lacking a simple form example, and someone recently posted an attempt to read a very basic table using the form method, but was not able to. I also gave it a shot, but without any success.

Q: Can someone help me configure the state machine in the YML file?
(This is used to parse the demo file from one of that repo's issues, and shown in the copied screenshot below.)

The YML configuration file.


extractor: "pdf.pdfbox"

header:
  default: 100
footer:
   default: 600

maxRowDistance: 2

rootRecordType: item
recordTypes:
  item:
    label: "item"
    valueTypes:
      - item
      - date
      - description
      - order_number
      - quantity
      - price

valueTypes:
  item:
    label: "Item"
  date:
    label: "Date"
  description:
    label: "Description"
  order_number:
    label: "OrderNo"
  quantity:
    label: "Qty"
  price:
    label: "Price"
 
initialState: "INIT"

states:
  INIT:
    transitions:
      -
        condition: item
        nextState: item

  item:
    startRecord: true
    transitions:
      -
        condition: date
        nextState: date  

  date:
    include: true
    transitions:
      -
        condition: description
        nextState: description  

  description:
    include: true
    transitions:
      -
        condition: description
        nextState: description     
      -
        condition: order_number
        nextState: order_number
      -
        condition: quantity
        nextState: quantity

  order_number:
    include: true
    transitions:
      -
        condition: order_number
        nextState: order_number
      -
        condition: quantity
        nextState: quantity

  quantity:
    include: true
    transitions:
      -
        condition: price
        nextState: price

  price:
    include: true
    transitions:
      -
        condition: end
        nextState: end

  end:
    include: false
    transitions:
      -
        condition: any
        nextState: end

conditions:

  item:         '73 < ulx < 110 and text =~ /(\d)*/'
  date:         '110 < ulx < 181 and text =~ /([0-9\-]*)/'
  description:  '193 < ulx < 366'
#  order_number: '12 <= uly_rel <= 16 and text =~ ^.+/((\d{6})\-)((\d{2}))/'
  order_number: '12 <= uly_rel <= 16 and text =~ ^.+((\d{6})\-)((\d{2}))'
  quantity:     '393 < ulx < 459'
  price:        '459 < ulx < 523'

  end:          'text =~ /(Footer)/'
  any: "1 = 1"

You may wonder why I am insisting in using the form processor for this simple example, but it is because in my real life document I will have a much more complex sub-structure of child items under the Description field. This can only (?) be processed efficiently by a state-machine, AFAIK.

But, maybe this is not the right tool for the job? So what other options are there?

UPDATE: (2021-05-18)

The author of Textricate has now bumped the libraries used, the documentation and corrected several working examples and user issues. Thanks to user mweber I now have a perfectly working parser and no longer need to use awk to handle weird columns.

mweber · Accepted Answer

As Textricator is kind of a hidden gem for pdf parsing imo, I'm happy to see someone using it and posted a config working with the sample document to the github issue:

extractor: "pdf.pdfbox"

header:
  default: 100
footer:
  default: 600

maxRowDistance: 2

rootRecordType: item
recordTypes:
  item:
    label: "item"
    valueTypes:
      - item
      - date
      - description
      - order_number
      - quantity
      - price

valueTypes:
  item:
    label: "Item"
  date:
    label: "Date"
  description:
    label: "Description"
  order_number:
    label: "OrderNo"
  quantity:
    label: "Qty"
  price:
    label: "Price"

initialState: "INIT"

states:
  INIT:
    include: false
    transitions:
      -
        condition: item
        nextState: item
      - condition: any
        nextState: INIT

  item:
    startRecord: true
    transitions:
      -
        condition: date
        nextState: date  

  date:
    include: true
    transitions:
      -
        condition: description
        nextState: description  

  description:
    include: true
    transitions:
      -
        condition: description
        nextState: description     
      -
        condition: order_number
        nextState: order_number
      -
        condition: quantity
        nextState: quantity
      -
        condition: item
        nextState: item

  order_number:
    include: true
    transitions:
      -
        condition: order_number
        nextState: order_number
      -
        condition: quantity
        nextState: quantity

  quantity:
    include: true
    transitions:
      - 
        condition: price
        nextState: price

  price:
    include: true
    transitions:
      -
        condition: end
        nextState: end
      - 
        condition: description
        nextState: description
      -
        condition: item
        nextState: item

  end:
    include: false
    transitions:
      -
        condition: any
        nextState: end

conditions:

  item:         '73 < ulx < 110 and text =~ /(\d)*/'
  date:         '110 < ulx < 181 and text =~ /([0-9\-]*)/'
  description:  '193 < ulx < 366'
  order_number: '12 <= uly_rel <= 16 and text =~ /^.+(([0-9]{6})\-)(([0-9]{2}))/'
  quantity:     '393 < ulx < 459'
  price:        '459 < ulx < 523'

  end:          'text =~ /(Footer)/'
  any: "1 = 1"

How to set the FSM configuaration for Textricator PDF OCR reader?

Answers (1)

Related Questions