beyeran
beyeran

Reputation: 885

Word splitting with regular expressions in Haskell

There are several packages available for the usage of regular expressions in Haskell (e.g. Text.Regex.Base, Text.Regex.Posix etc.). Most packages I've seen so far use a subset of Regex I know, by which I mean: I am used to split a sentence into words with the following Regex:

\\w+

Nearly all packages in Haskell I tried so far don't support this (at least the earlier mentioned and Text.Regex.TDFA neither). I know that with Posix the usage of [[:word:]+] would have the same effect, but I would like to use the variant mentioned above.

From there are two questions:

  1. Is there any package to archive that?
  2. If there really is, why is there a different common usage?
  3. What advantages or disadvantages are there?

Upvotes: 12

Views: 3575

Answers (4)

Marko Tunjic
Marko Tunjic

Reputation: 1889

words function works well, but it's more like 'split by white space', use splitRegex.

import Text.Regex (splitRegex, mkRegex)

splitByWord :: String -> [String]
splitByWord = splitRegex (mkRegex "[^a-zA-Z]+")

>splitByWord "Word splitting with regular expressions in Haskell"
>["Word","splitting","with","regular","expressions","in","Haskell"]

Upvotes: 3

Matvey Aksenov
Matvey Aksenov

Reputation: 3900

I'd use Adam's suggestion or (perhaps more readable)

> :m +Data.Char
> :m +Data.List.Split
> wordsBy (not . isLetter) "Just a simple test."
["Just","a","simple","test"]

No need in regexps here.

Upvotes: 13

Adam Wagner
Adam Wagner

Reputation: 16117

If you want to break into words, and filter out things other than letters, you could use filter and isAlpha or isAlphaNum (or any of the other is functions in Data.Char that suite your need.)

import Data.Char

wordsButOnlyLetters = map (filter isAlpha) . words

Upvotes: 6

Chris Kuklewicz
Chris Kuklewicz

Reputation: 8153

The '\w' is a Perl pattern, and supported by PCRE, which you can access in Haskell with my regex-pcre package or the pcre-light library. If your input is a list of Char then the 'words' function in the standard Prelude may be enough; if your input is ASCII bytestring then Data.ByteString.Char8 may work. There may be a utf8 library with word splitting, but I cannot quickly find it.

Upvotes: 11

Related Questions