Recursively return all words from .txt file using attoparsec

Question

I am fairly new to Haskell and I'm just starting to learn how to work with attoparsec for parsing huge chunks of english text from a .txt file. I know how to get the number of words in a .txt file without using attoparsec, but I'm kinda stuck with attoparsec. When I run my code below, on let's say

"Hello World, I am Elliot Anderson. And I'm Mr.Robot. "

I only get back:

World, I am Elliot Anderson. And I'm Mr.Robot. " (Prose {word = "Hello"})

This is my current code:

{-# LANGUAGE OverloadedStrings #-}
import Control.Exception (catch, SomeException)
import System.Environment (getArgs)
import Data.Attoparsec.Text
import qualified Data.Text.IO as Txt
import Data.Char
import Control.Applicative ((<*>), (*>), (<$>), (<|>), pure)

{-
This is how I would usually get the length of the list of words in a .txt file normally.

countWords :: String -> Int
countWords input = sum $ map (length.words) (lines input)

-}

data Prose = Prose {
  word :: String
} deriving Show

prose :: Parser Prose
prose = do
  word <- many' $ letter
  return $ Prose word

main :: IO()
main = do
  input <- Txt.readFile "small.txt"
  print $ parse prose input

Also how can I get the integer count of words, later on? Furthermore any suggestions on how to get started with attoparsec?

Vora · Accepted Answer

You have a pretty good start already - you can parse a word.
What you need next is a Parser [Prose], which can be expressed by combining your prose parser with another one which consumes the "not prose" parts, using sepBy or sepBy1, which you can look up in the Data.Attoparsec.Text documentation.

From there, the easiest way to get the word count would be to simply get the length of your obtained [Prose].

EDIT:

Here is a minimal working example. The Parser runner has been swapped for parseOnly to allow for residual input to be ignored, meaning that a trailing non-word won't make the parser go cray-cray.

{-# LANGUAGE OverloadedStrings #-}

module Atto where

--import qualified Data.Text.IO as Txt
import Data.Attoparsec.Text
import Control.Applicative ((*>), (<$>), (<|>), pure)

import qualified Data.Text as T

data Prose = Prose {
  word :: String
} deriving Show

optional :: Parser a -> Parser ()
optional p = option () (try p *> pure ())

-- Modified to disallow empty words, switched to applicative style
prose :: Parser Prose
prose = Prose <$> many1' letter

separator :: Parser ()
separator = many1 (space <|> satisfy (inClass ",.'")) >> pure ()

wordParser :: String -> [Prose]
wordParser str = case parseOnly wp (T.pack str) of
    Left err -> error err
    Right x -> x
    where
        wp = optional separator *> prose `sepBy1` separator

main :: IO ()
main = do
  let input = "Hello World, I am Elliot Anderson. 
And I'm Mr.Robot.
"
  let words = wordParser input
  print words
  print $ length words

The provided parser does not give the exact same result as concatMap words . lines since it also breaks words on .,'. Modifying this behaviour is left as a simple exercise.

Hope it helps! :)

Recursively return all words from .txt file using attoparsec

Answers (2)

Related Questions