centrinok
centrinok

Reputation: 298

Recursively return all words from .txt file using attoparsec

I am fairly new to Haskell and I'm just starting to learn how to work with attoparsec for parsing huge chunks of english text from a .txt file. I know how to get the number of words in a .txt file without using attoparsec, but I'm kinda stuck with attoparsec. When I run my code below, on let's say

"Hello World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n"

I only get back:

World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n" (Prose {word = "Hello"})

This is my current code:

{-# LANGUAGE OverloadedStrings #-}
import Control.Exception (catch, SomeException)
import System.Environment (getArgs)
import Data.Attoparsec.Text
import qualified Data.Text.IO as Txt
import Data.Char
import Control.Applicative ((<*>), (*>), (<$>), (<|>), pure)

{-
This is how I would usually get the length of the list of words in a .txt file normally.

countWords :: String -> Int
countWords input = sum $ map (length.words) (lines input)

-}

data Prose = Prose {
  word :: String
} deriving Show

prose :: Parser Prose
prose = do
  word <- many' $ letter
  return $ Prose word

main :: IO()
main = do
  input <- Txt.readFile "small.txt"
  print $ parse prose input

Also how can I get the integer count of words, later on? Furthermore any suggestions on how to get started with attoparsec?

Upvotes: 1

Views: 304

Answers (2)

Benjamin Hodgson
Benjamin Hodgson

Reputation: 44634

You're on the right track! You've written a parser (prose) which reads a single word: many' letter recognises a sequence of letters.

So now that you've figured out how to parse a single word, your job is to scale this up to parse a sequence of words separated by spaces. That's what sepBy does: p `sepBy` q runs the p parser repeatedly with the q parser interspersed.

So a parser for a sequence of words looks something like this (I've taken the liberty of renaming your prose to word):

word = many letter
phrase = word `sepBy` some space  -- "some" runs a parser one-or-more times

ghci> parseOnly phrase "wibble wobble wubble"  -- with -XOverloadedStrings
Right ["wibble","wobble","wubble"]

Now, phrase, being composed out of letter and space, will die on non-letter non-space characters such as ' and .. I'll leave it to you to figure out how to fix that. (As a hint, you'll probably need to change many letter to many (letter <|> ...), depending on how exactly you want it to behave on the various punctuation marks.)

Upvotes: 2

Vora
Vora

Reputation: 301

You have a pretty good start already - you can parse a word.
What you need next is a Parser [Prose], which can be expressed by combining your prose parser with another one which consumes the "not prose" parts, using sepBy or sepBy1, which you can look up in the Data.Attoparsec.Text documentation.

From there, the easiest way to get the word count would be to simply get the length of your obtained [Prose].

EDIT:

Here is a minimal working example. The Parser runner has been swapped for parseOnly to allow for residual input to be ignored, meaning that a trailing non-word won't make the parser go cray-cray.

{-# LANGUAGE OverloadedStrings #-}

module Atto where

--import qualified Data.Text.IO as Txt
import Data.Attoparsec.Text
import Control.Applicative ((*>), (<$>), (<|>), pure)

import qualified Data.Text as T

data Prose = Prose {
  word :: String
} deriving Show

optional :: Parser a -> Parser ()
optional p = option () (try p *> pure ())

-- Modified to disallow empty words, switched to applicative style
prose :: Parser Prose
prose = Prose <$> many1' letter

separator :: Parser ()
separator = many1 (space <|> satisfy (inClass ",.'")) >> pure ()

wordParser :: String -> [Prose]
wordParser str = case parseOnly wp (T.pack str) of
    Left err -> error err
    Right x -> x
    where
        wp = optional separator *> prose `sepBy1` separator

main :: IO ()
main = do
  let input = "Hello World, I am Elliot Anderson. \nAnd I'm Mr.Robot.\n"
  let words = wordParser input
  print words
  print $ length words

The provided parser does not give the exact same result as concatMap words . lines since it also breaks words on .,'. Modifying this behaviour is left as a simple exercise.

Hope it helps! :)

Upvotes: 3

Related Questions