Reputation: 885
There are several packages available for the usage of regular expressions in Haskell (e.g. Text.Regex.Base, Text.Regex.Posix etc.). Most packages I've seen so far use a subset of Regex I know, by which I mean: I am used to split a sentence into words with the following Regex:
\\w+
Nearly all packages in Haskell I tried so far don't support this (at least the earlier mentioned and Text.Regex.TDFA neither). I know that with Posix the usage of [[:word:]+] would have the same effect, but I would like to use the variant mentioned above.
From there are two questions:
Upvotes: 12
Views: 3575
Reputation: 1889
words function works well, but it's more like 'split by white space', use splitRegex.
import Text.Regex (splitRegex, mkRegex)
splitByWord :: String -> [String]
splitByWord = splitRegex (mkRegex "[^a-zA-Z]+")
>splitByWord "Word splitting with regular expressions in Haskell"
>["Word","splitting","with","regular","expressions","in","Haskell"]
Upvotes: 3
Reputation: 3900
I'd use Adam's suggestion or (perhaps more readable)
> :m +Data.Char
> :m +Data.List.Split
> wordsBy (not . isLetter) "Just a simple test."
["Just","a","simple","test"]
No need in regexps here.
Upvotes: 13
Reputation: 16117
If you want to break into words, and filter out things other than letters, you could use filter and isAlpha
or isAlphaNum
(or any of the other is
functions in Data.Char
that suite your need.)
import Data.Char
wordsButOnlyLetters = map (filter isAlpha) . words
Upvotes: 6
Reputation: 8153
The '\w' is a Perl pattern, and supported by PCRE, which you can access in Haskell with my regex-pcre package or the pcre-light library. If your input is a list of Char then the 'words' function in the standard Prelude may be enough; if your input is ASCII bytestring then Data.ByteString.Char8 may work. There may be a utf8 library with word splitting, but I cannot quickly find it.
Upvotes: 11