Reputation: 2213
I am looking into parslet to write alot of data import code. Overall, the library looks good, but I'm struggling with one thing. Alot of our input files are fixed width, and the widths differ between formats, even if the actual field doesn't. For example, we might get a file that has a 9-character currency, and another that has 11-characters (or whatever). Does anyone know how to define a fixed width constraint on a parslet atom?
Ideally, I would like to be able to define an atom that understands currency (with optional dollar signs, thousand separators, etc...) And then I would be able to, on the fly, create a new atom based on the old one that is exactly equivalent, except that it parses exactly N characters.
Does such a combinator exist in parslet? If not, would it be possible/difficult to write one myself?
Upvotes: 3
Views: 341
Reputation: 21548
What about something like this...
class MyParser < Parslet::Parser
def initialize(widths)
@widths = widths
super
end
rule(:currency) {...}
rule(:fixed_c) {currency.fixed(@widths[:currency])}
rule(:fixed_str) {str("bob").fixed(4)}
end
puts MyParser.new.fixed_str.parse("bob").inspect
This will fail with:
"Expected 'bob' to be 4 long at line 1 char 1"
Here's how you do it:
require 'parslet'
class Parslet::Atoms::FixedLength < Parslet::Atoms::Base
attr_reader :len, :parslet
def initialize(parslet, len, tag=:length)
super()
raise ArgumentError,
"Asking for zero length of a parslet. (#{parslet.inspect} length #{len})" \
if len == 0
@parslet = parslet
@len = len
@tag = tag
@error_msgs = {
:lenrep => "Expected #{parslet.inspect} to be #{len} long",
:unconsumed => "Extra input after last repetition"
}
end
def try(source, context, consume_all)
start_pos = source.pos
success, value = parslet.apply(source, context, false)
return succ(value) if success && value.str.length == @len
context.err_at(
self,
source,
@error_msgs[:lenrep],
start_pos,
[value])
end
precedence REPETITION
def to_s_inner(prec)
parslet.to_s(prec) + "{len:#{@len}}"
end
end
module Parslet::Atoms::DSL
def fixed(len)
Parslet::Atoms::FixedLength.new(self, len)
end
end
Upvotes: 1
Reputation: 2213
Maybe my partial solution will help to clarify what I meant in the question.
Let's say you have a somewhat non-trivial parser:
class MyParser < Parslet::Parser
rule(:dollars) {
match('[0-9]').repeat(1).as(:dollars)
}
rule(:comma_separated_dollars) {
match('[0-9]').repeat(1, 3).as(:dollars) >> ( match(',') >> match('[0-9]').repeat(3, 3).as(:dollars) ).repeat(1)
}
rule(:cents) {
match('[0-9]').repeat(2, 2).as(:cents)
}
rule(:currency) {
(str('$') >> (comma_separated_dollars | dollars) >> str('.') >> cents).as(:currency)
# order is important in (comma_separated_dollars | dollars)
}
end
Now if we want to parse a fixed-width Currency string; this isn't the easiest thing to do. Of course, you could figure out exactly how to express the repeat expressions in terms of the final width, but it gets really unnecessarily tricky, especially in the comma separated case. Also, in my use case, currency is really just one example. I want to be able to have an easy way to come up with fixed-width definitions for adresses, zip codes, etc....
This seems like something that should be handle-able by a PEG. I managed to write a prototype version, using Lookahead as a template:
class FixedWidth < Parslet::Atoms::Base
attr_reader :bound_parslet
attr_reader :width
def initialize(width, bound_parslet) # :nodoc:
super()
@width = width
@bound_parslet = bound_parslet
@error_msgs = {
:premature => "Premature end of input (expected #{width} characters)",
:failed => "Failed fixed width",
}
end
def try(source, context) # :nodoc:
pos = source.pos
teststring = source.read(width).to_s
if (not teststring) || teststring.size != width
return error(source, @error_msgs[:premature]) #if not teststring && teststring.size == width
end
fakesource = Parslet::Source.new(teststring)
value = bound_parslet.apply(fakesource, context)
return value if not value.error?
source.pos = pos
return error(source, @error_msgs[:failed])
end
def to_s_inner(prec) # :nodoc:
"FIXED-WIDTH(#{width}, #{bound_parslet.to_s(prec)})"
end
def error_tree # :nodoc:
Parslet::ErrorTree.new(self, bound_parslet.error_tree)
end
end
# now we can easily define a fixed-width currency rule:
class SHPParser
rule(:currency15) {
FixedWidth.new(15, currency >> str(' ').repeat)
}
end
Of course, this is a pretty hacked solution. Among other things, line numbers and error messages are not good inside of a fixed width constraint. I would love to see this idea implemented in a better fashion.
Upvotes: 1
Reputation: 114
Methods in parser classes are basically generators for parslet atoms. The simplest form these methods come in are 'rule's, methods that just return the same atoms every time they are called. It is just as easy to create your own generators that are not such simple beasts. Please look at http://kschiess.github.com/parslet/tricks.html for an illustration of this trick (Matching strings case insensitive).
It seems to me that your currency parser is a parser with only a few parameters and that you could probably create a method (def ... end) that returns currency parsers tailored to your liking. Maybe even use initialize and constructor arguments? (ie: MoneyParser.new(4,5))
For more help, please address your questions to the mailing list. Such questions are often easier to answer if you illustrate it with code.
Upvotes: 1