thor
thor

Reputation: 22530

extract string between two substrings in Haskell

I wanted to adapt the python regex (PCRE) technique in this SO question Find string between two substrings to Haskell so that I can do the same in Haskell.

But I can't figure out how to make it work in GHC (8.2.1). I've installed cabal install regex-pcre, and came up with the following test code after some search:

import Text.Regex.PCRE
s = "+++asdf=5;iwantthis123jasd---"
result = (s ++ s) =~ "asdf=5;(.*)123jasd" :: [[String]]

I was hoping to get the first and last instance of the middle string

iwantthis

But I can't get the result right:

[["asdf=5;iwantthis123jasd---+++asdf=5;iwantthis123jasd","iwantthis123jasd---+++asdf=5;iwantthis"]]

I haven't used regex or pcre in Haskell before.

Can someone help with the right usage (to extract the first and last occurrence) ? Also, I don't quite understand the ::[[String]] usage here. What does it do and why is it necessary?

I searched the documentation but found no mention of the usage with type conversion to :: [[String]].

Upvotes: 1

Views: 677

Answers (1)

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 477265

The result you obtain is the following:

Prelude Text.Regex.PCRE> (s ++ s) =~ "asdf=5;(.*)123jasd" :: [[String]]
[["asdf=5;iwantthis123jasd---+++asdf=5;iwantthis123jasd","iwantthis123jasd---+++asdf=5;iwantthis"]]

This is correct, the first element is the implicit capture group 0 (the entire regex), and the second element is that of capture group 1 (the one that matches (.*). Since it matches like:

+++asdf=5;iwantthis123jasd---+++asdf=5;iwantthis123jasd---

So it still matches between the asdf=5; and 123jasd part.

This is due to the fact that the Kleene start * matches greedy: it aims to capture as much as possible. You can use (.*?) however to use a non-greedy quantifier:

Prelude Text.Regex.PCRE> (s ++ s) =~ "asdf=5;(.*?)123jasd" :: [[String]]
[["asdf=5;iwantthis123jasd","iwantthis"],["asdf=5;iwantthis123jasd","iwantthis"]]

And now we obtain two matches. Each match has "iwantthis" as capture group 1.

You can use map (head . tail) or map (!!1) on it to obtain a list of captures of the (.*?) part:

Prelude Text.Regex.PCRE> map (!!1) ((s ++ s) =~ "asdf=5;(.*?)123jasd" :: [[String]])
["iwantthis","iwantthis"]

Upvotes: 4

Related Questions