Jonathan
Jonathan

Reputation: 2213

How to define a fixed-width constraint in parslet

I am looking into parslet to write alot of data import code. Overall, the library looks good, but I'm struggling with one thing. Alot of our input files are fixed width, and the widths differ between formats, even if the actual field doesn't. For example, we might get a file that has a 9-character currency, and another that has 11-characters (or whatever). Does anyone know how to define a fixed width constraint on a parslet atom?

Ideally, I would like to be able to define an atom that understands currency (with optional dollar signs, thousand separators, etc...) And then I would be able to, on the fly, create a new atom based on the old one that is exactly equivalent, except that it parses exactly N characters.

Does such a combinator exist in parslet? If not, would it be possible/difficult to write one myself?

Upvotes: 3

Views: 341

Answers (3)

Nigel Thorne
Nigel Thorne

Reputation: 21548

What about something like this...

class MyParser < Parslet::Parser
    def initialize(widths)
        @widths = widths
        super
    end

    rule(:currency)  {...}
    rule(:fixed_c)   {currency.fixed(@widths[:currency])}


    rule(:fixed_str) {str("bob").fixed(4)}
end 

puts MyParser.new.fixed_str.parse("bob").inspect

This will fail with:

"Expected 'bob' to be 4 long at line 1 char 1"

Here's how you do it:

require 'parslet'

class Parslet::Atoms::FixedLength < Parslet::Atoms::Base  
  attr_reader :len, :parslet
  def initialize(parslet, len, tag=:length)
    super()

    raise ArgumentError, 
      "Asking for zero length of a parslet. (#{parslet.inspect} length #{len})" \
      if len == 0

    @parslet = parslet
    @len = len
    @tag = tag
    @error_msgs = {
      :lenrep  => "Expected #{parslet.inspect} to be #{len} long", 
      :unconsumed => "Extra input after last repetition"
    }
  end

  def try(source, context, consume_all)
    start_pos = source.pos

    success, value = parslet.apply(source, context, false)

    return succ(value) if success && value.str.length == @len

    context.err_at(
      self, 
      source, 
      @error_msgs[:lenrep], 
      start_pos, 
      [value]) 
  end

  precedence REPETITION
  def to_s_inner(prec)
    parslet.to_s(prec) + "{len:#{@len}}"
  end
end

module Parslet::Atoms::DSL
  def fixed(len)
    Parslet::Atoms::FixedLength.new(self, len)
  end
end

Upvotes: 1

Jonathan
Jonathan

Reputation: 2213

Maybe my partial solution will help to clarify what I meant in the question.

Let's say you have a somewhat non-trivial parser:

class MyParser < Parslet::Parser
    rule(:dollars) {
        match('[0-9]').repeat(1).as(:dollars)
    }
    rule(:comma_separated_dollars) {
        match('[0-9]').repeat(1, 3).as(:dollars) >> ( match(',') >> match('[0-9]').repeat(3, 3).as(:dollars) ).repeat(1)
    }
    rule(:cents) {
        match('[0-9]').repeat(2, 2).as(:cents)
    }
    rule(:currency) {
        (str('$') >> (comma_separated_dollars | dollars) >> str('.') >> cents).as(:currency)
        # order is important in (comma_separated_dollars | dollars)
    }
end

Now if we want to parse a fixed-width Currency string; this isn't the easiest thing to do. Of course, you could figure out exactly how to express the repeat expressions in terms of the final width, but it gets really unnecessarily tricky, especially in the comma separated case. Also, in my use case, currency is really just one example. I want to be able to have an easy way to come up with fixed-width definitions for adresses, zip codes, etc....

This seems like something that should be handle-able by a PEG. I managed to write a prototype version, using Lookahead as a template:

class FixedWidth < Parslet::Atoms::Base
    attr_reader :bound_parslet
    attr_reader :width

    def initialize(width, bound_parslet) # :nodoc:
        super()

        @width = width
        @bound_parslet = bound_parslet
        @error_msgs = {
            :premature => "Premature end of input (expected #{width} characters)",
            :failed => "Failed fixed width",
        }
    end

    def try(source, context) # :nodoc:
        pos = source.pos
        teststring = source.read(width).to_s
        if (not teststring) || teststring.size != width
            return error(source, @error_msgs[:premature]) #if not teststring && teststring.size == width
        end
        fakesource = Parslet::Source.new(teststring)
        value = bound_parslet.apply(fakesource, context)
        return value if not value.error?

        source.pos = pos
        return error(source, @error_msgs[:failed])
    end

    def to_s_inner(prec) # :nodoc:
        "FIXED-WIDTH(#{width}, #{bound_parslet.to_s(prec)})"
    end

    def error_tree # :nodoc:
        Parslet::ErrorTree.new(self, bound_parslet.error_tree)
    end
end

# now we can easily define a fixed-width currency rule:
class SHPParser
    rule(:currency15) {
        FixedWidth.new(15, currency >> str(' ').repeat)
    }
end

Of course, this is a pretty hacked solution. Among other things, line numbers and error messages are not good inside of a fixed width constraint. I would love to see this idea implemented in a better fashion.

Upvotes: 1

kaspar
kaspar

Reputation: 114

Methods in parser classes are basically generators for parslet atoms. The simplest form these methods come in are 'rule's, methods that just return the same atoms every time they are called. It is just as easy to create your own generators that are not such simple beasts. Please look at http://kschiess.github.com/parslet/tricks.html for an illustration of this trick (Matching strings case insensitive).

It seems to me that your currency parser is a parser with only a few parameters and that you could probably create a method (def ... end) that returns currency parsers tailored to your liking. Maybe even use initialize and constructor arguments? (ie: MoneyParser.new(4,5))

For more help, please address your questions to the mailing list. Such questions are often easier to answer if you illustrate it with code.

Upvotes: 1

Related Questions