Reputation: 601
Learning to use the Parsec library, part of homework.
EDIT: Suggestions to use other libraries are welcome, the point is the parsing.
What I want, is to extract all words with a capital letter, and four compass directions from any sentence. Example: "Belgium totally lies south of Holland." should find and return "Belgium south Holland".
What I can't figure is how to ignore (eat) any input that is -not- a compass direction. I was looking to find something along the lines of
'many (not compassDirection >> space)'
but g(h)oogle isn't helping me.
Following code is obviously stuck on the 'many' function.
readExpr :: String -> String
readExpr input = case parse (parseLine) "" input of
Left err -> "No match: " ++ show err
Right val -> "Found: " ++ showVal val
parseLine :: Parser GraphValue
parseLine = do
x <- parseCountry
space
many ( some (noneOf " ") >> space )
y <- parseCompass
space
many ( some (noneOf " ") >> space )
z <- parseCountry
return $ Direction [x,y,z]
compassDirection :: Parser String
compassDirection = string "north" <|>
string "south" <|>
string "east" <|>
string "west"
parseCountry :: Parser GraphValue
parseCountry = do
c <- upper
x <- many (lower)
return $ Country (c:x)
parseCompass :: Parser GraphValue
parseCompass = do
x <- compassDirection
return $ Compass x
Upvotes: 5
Views: 1173
Reputation: 32455
I'm going to tell you how I would start and then advice about how I'd carry on.
I'd base this on an abstract data structure - as you add extra words you can classify them more closely:
data Word = Country String | Direction NSEW | Unclassified String
data NESW = North | East | South | West
so my answer to how you skip words you don't know about is that you don't need to - leave them as unclassified.
The applicative style is nicer than the monadic style.
I think compassDirection
should allow capitals:
compassDirection :: Parser NESW
compassDirection = north <|> south <|> east <|> west where
north = North <$ (string "north" <|> string "North")
east = ...
You can define country
using Country <$> ((:) <$> upper <*> many lower)
Then you can have a catch-all Unclassified <$> many letter
.
Your word parser can currently be
word = compassDirection <|> country <|> unclassified
but notice that compassDirection
has to come before country
because otherwise country
would match North
.
You can do
words = word `sepBy1` space
at the moment that's OK, but you must must must not use word
or words
when you're analysing sentences more properly, because you lose control over what the word is. At that point you'd need noun
, adjective
, nounPhrase
, verb
, adjective
, adjectivalPhrase
etc to pasrse a sentence structure. Sentences that you can't parse mean you need to add new constructs to your grammar.
It's worth making word parsers swallow the whitespace after them (or before), or refactoring using a preprocessor that splits words from spaces and punctuation. Consider having a fullStop
parser if you're British or a period
parser if you're American. Use it when you create a sentence parser.
Using applicative and higher order functions will make it much clearer to write your grammar, because you'll not have it cluttered up with monadic notation, and it will look like sentences.
Example: you could do nvn = NVN <$> noun <*> verb <*> noun
if you want to use a primarily Abstract Data Structure (AST) approach with one constructor per grammatical object.
If you just prefer to have a bunch of words lying around all of the same type, you can do nvn = sequence [noun,verb,noun]
.
Most computer languages are parsed using an AST approach, but I've no direct experience of natural language parsing beyond what I picked up second-hand from my wife's linguistics degree.
If you sit down and write how you can combine categories of words, phrases, clauses and sentences together, you'll be able to write the parser fairly quickly.
Upvotes: 4
Reputation: 12749
Can't you just split the string into words
, filter
the ones that start with an uppercase letter or are a compass direction, and then unwords
them back together? No need to pull out the Parsec
gun.
Upvotes: 0
Reputation: 49085
I won't go into specifics since this is homework and the OP said the "important thing is the parsing".
The way I'd solve this problem:
tokenize the input. Break it into words; this will free the real parsing step from having to worry about token definitions (i.e. "is %#@[ part of a word?") or whitespace. This could be as simple as words
or you could use Parsec for the tokenization. Then you'll have [Token]
(or [String]
if you prefer).
a parser for compass directions. You already have this (good job), but it'll have to modified a bit if the input is [String]
instead of String
.
a parser for words that start with a capital letter.
a parser for everything else, that succeeds whenever it sees a token that isn't a compass direction or a word starting with a caps.
a parser that works on any token, but distinguishes between good stuff and bad stuff, perhaps using an algebraic data type.
a parser that works on lots of tokens
Hopefully that's clear without being too clear; you'll still have to worry about when to discard the junk, for example. The basic idea is to break the problem down into lots of little sub-problems, solve the sub-problems, then glue those solutions together.
Upvotes: 4