Reputation: 1179
Here's an example of the file I'm trying to parse:
XX00135 ABCDEFGHIJ RISK SOLUTIONS PAGE NO : 7
BEG PER: 03/17/2014 CURRENT COMPANY 03/18/2014
END PER: 03/18/2014 QA PROCESS - REJECT REPORT 20:28:36
BATCH: 123456789 CONTRIB: 987654321 - ABCDE FGHI-SAN DIEGO
QUOTE BACK: 1A23B45C79
CODE ACCOUNT NO TYP COMPANY NAME BEG DATE END DATE ERR
------ -------------------- --- -------------------- -------- -------- ---
12345 1234567890001 AB ABCDE FGHI PRODUCTS 20140314 20140914 059
XX00135 ABCDEFGHIJ RISK SOLUTIONS PAGE NO : 8
BEG PER: 03/17/2014 CURRENT COMPANY 03/18/2014
END PER: 03/18/2014 QA PROCESS - REJECT REPORT 20:28:36
BATCH: 234567890 CONTRIB: 987654321 - ABCDE FGHI-SAN DIEGO
QUOTE BACK: 5F7A657G87
CODE ACCOUNT NO TYP COMPANY NAME BEG DATE END DATE ERR
------ -------------------- --- -------------------- -------- -------- ---
12346 2345678901 AB ABCDE FGHI PRODUCTS 20140129 20140729 059
12346 3456789012 AB ABCDE FGHI PRODUCTS 20140317 20140917 059
XX00135 ABCDEFGHIJ RISK SOLUTIONS PAGE NO : 9
BEG PER: 03/17/2014 CURRENT COMPANY 03/18/2014
END PER: 03/18/2014 QA PROCESS - REJECT REPORT 20:28:36
BATCH: 345678901 CONTRIB: 987654321 - ABCDE FGHI-SAN DIEGO
QUOTE BACK: 6K75L8791L
CODE ACCOUNT NO TYP COMPANY NAME BEG DATE END DATE ERR
------ -------------------- --- -------------------- -------- -------- ---
12346 4567890123 AB ABCDE FGHI PRODUCTS 20140317 20140917 059
12346 4567890123 AB ABCDE FGHI PRODUCTS 20140317 20140917 059
NUMBER OF SETS REJECTED ARE : 13 TOTAL SETS IN BATCH: 16,940
*** END OF REPORT ***
And here is a collection of snippets from my module:
module XX00135 (parseFile) where
import Control.Applicative ((<$>), (<*>), (<*))
import Text.ParserCombinators.Parsec hiding (Line)
data Line = Line { code :: String
, account :: String
, aType :: String
, company :: String
, begDate :: String
, endDate :: String
, errCode :: String }
data Page = Page { periodBeginning :: String
, periodEnd :: String
, reportDate :: String
, batch :: String
, contrib :: String
, quoteBack :: String
, lineList :: [Line] }
data Report = Report { pages :: [Page] }
parseReportDate :: Parser String
parseReportDate =
manyTill anyChar (string "CURRENT COMPANY") >> spaces >> count 10 anyChar
headers :: Parser String
headers =
choice [ try (string "\n")
, try (string "CODE ACCOUNT NO TYP COMPANY NAME BEG DATE END DATE ERR")
, try (string "------ -------------------- --- -------------------- -------- -------- ---") ]
line :: Parser Line
line =
Line <$> count 6 anyChar <* space
<*> count 20 anyChar <* space
<*> count 3 anyChar <* space
<*> count 20 anyChar <* space
<*> count 8 anyChar <* space
<*> count 8 anyChar <* space
<*> count 3 anyChar <* newline
page :: Parser Page
page =
Page <$> (manyTill anyChar (string "BEG PER:") >> space >> count 10 anyChar)
<*> parseReportDate
<*> (manyTill anyChar (string "END PER:") >> space >> count 10 anyChar)
<*> (manyTill anyChar (string "BATCH:") >> space >> count 9 anyChar)
<*> (space >> string "CONTRIB:" >> space >> count 9 anyChar)
<*> (manyTill anyChar (string "QUOTE BACK:") >> space >> count 10 anyChar
<* skipMany1 headers)
<*> (manyTill line (twoNewLines <|> footer))
report :: Parser Report
report = Report <$> manyTill page (try footer)
twoNewLines :: Parser ()
twoNewLines = (count 2 newline) >> return ()
footer :: Parser ()
footer = (space >> string "NUMBER OF SETS REJECTED" >> manyTill anyChar (string "*** END OF REPORT ***") >> optional eof) >> return ()
parseFile :: [(String, String)] -> String -> String
parseFile errors text =
let rs = case parse (manyTill report eof) "" text of
...
There are 115 lines in the full file. When I cat
the file and pipe it to my haskell, I get:
(line 116, column 1);
unexpected end of input
expecting "BEG PER:"
I had it working by just ignoring the footer and anything that followed. But my full use case is to cat
multiple files and pipe that to my haskell, meaning that I cannot just throw away the footer and everything that follows it. Once I started trying to ignore the footer instead of just throwing it away, my problems began. It's probably something simple, and I'm just confused and over-looking something obvious.
Let me know if you need more code. I do a few transformations after parsing, and I didn't want to clutter the code with unnecessary detail.
Thanks!
Upvotes: 0
Views: 811
Reputation: 1179
I've resolved the problem. The code is a little different, and I'm not sure what exactly solved the problem. I spent a lot of time staring at the code and making little changes here and there. I think, though, that it had to do with cat
appending a newline
to the file. So I changed footer
:
footer = space >> string "NUMBER OF SETS REJECTED"
>> anyChar `manyTill` (string "*** END OF REPORT ***") >> newline >> string ""
Now footer consumes an extra newline
at the end of the file, and returns a string. I use footer
in eop
(end of page):
eop =
choice [ count 2 newline
, footer ]
And I use eop
in the last line of page
:
<*> line `manyTill` eop
report
is now:
report = count 2 newline >> Report <$> many page
I also changed page
. I think it was consuming anyChar
in unexpected ways. So now I throw away the first line of each page:
page = firstLine >>
Page <$> (string "BEG PER:" >> space >> count 10 anyChar)
...
firstLine =
string "XX00135 ABCDEFGHIJ RISK SOLUTIONS PAGE NO :"
>> spaces > many digit >> newline
I think that covers all the important changes I made that eventually made the parse successful. It now parses a single file from the cat
command, as well as multiple files concatenated by the cat
command. Yay! I love Haskell.
Upvotes: 1
Reputation: 8153
It looks like page consumes footer:
<*> (manyTill line (twoNewLines <|> footer))
And thus report does not get to consume footer:
report = Report <$> manyTill page (try footer)
Perhaps you need 'sepBy' to recognize 'twoNewLines' between your 'page' (without that last manyTill).
Upvotes: 0