capgelka
capgelka

Reputation: 99

Haskell fast text processing

I am working on function to convert javascript encoded unicode string from server api to utf8.

I have 2 approaches. The first is not general enough, and second is too slow. How can I do it both fast and for all unicode symbols?

First is just replacing some substring using Map

ununicode :: BL.ByteString -> BL.ByteString               
ununicode s = LE.encodeUtf8 $ replace $ LE.decodeUtf8 s where 
  replace :: L.Text -> L.Text
  replace "" = ""
  replace string = case Map.lookup (L.take 6 string) table of
          (Just x)  -> L.append x (replace $ L.drop 6 string)
          Nothing   -> L.cons (L.head string) (replace $ L.tail string)

  table = Map.fromList $ zip letters rus

  rus =  ["Ё", "ё", "А", "Б", "В", "Г", "Д", "Е", "Ж", "З", "И", "Й", "К", "Л", "М",
         "Н", "О", "П", "Р", "С", "Т", "У", "Ф", "Х", "Ц", "Ч", "Ш", "Щ", "Ъ", "Ы",
         "Ь", "Э", "Ю", "Я", "а", "б", "в", "г", "д", "е", "ж", "з", "и", "й", "к",
         "л", "м", "н", "о", "п", "р", "с", "т", "у", "ф", "х", "ц", "ч", "ш", "щ",
         "ъ", "ы", "ь", "э", "ю", "я", "—"]  :: [L.Text]

  letters = ["\\u0401", "\\u0451", "\\u0410", "\\u0411", "\\u0412", "\\u0413", 
             "\\u0414", "\\u0415", "\\u0416", "\\u0417", "\\u0418", "\\u0419",
             "\\u041a", "\\u041b", "\\u041c", "\\u041d", "\\u041e", "\\u041f",
             "\\u0420", "\\u0421", "\\u0422", "\\u0423", "\\u0424", "\\u0425",
             "\\u0426", "\\u0427", "\\u0428", "\\u0429", "\\u042a", "\\u042b",
             "\\u042c", "\\u042d", "\\u042e", "\\u042f", "\\u0430", "\\u0431",
             "\\u0432", "\\u0433", "\\u0434", "\\u0435", "\\u0436", "\\u0437",
             "\\u0438", "\\u0439", "\\u043a", "\\u043b", "\\u043c", "\\u043d",
             "\\u043e", "\\u043f", "\\u0440", "\\u0441", "\\u0442", "\\u0443",
             "\\u0444", "\\u0445", "\\u0446", "\\u0447", "\\u0448", "\\u0449",
             "\\u044a", "\\u044b", "\\u044c", "\\u044d", "\\u044e", "\\u044f",
             "\\u2014"] :: [L.Text]

And in second I use finite automata in foldl function. (I wanted to use regexp but found no lib that supports function instead of string in sub like in python https://docs.python.org/3/library/re.html#re.sub)

ununicode :: BL.ByteString -> BL.ByteString               
ununicode s = LE.encodeUtf8 $  parts $ LE.decodeUtf8 s where 

  parts :: L.Text -> L.Text
  parts = fst . parts' where
      lst (_, _, x) = x
      snd (_, x, _) = x
      fst (x, _, _) = x
      parts' :: L.Text -> (L.Text, Integer, L.Text)
      parts' = L.foldl f ("", 0, "") where
          f :: (L.Text, Integer, L.Text) -> Char -> (L.Text, Integer, L.Text)
          f p n | snd p == 0 = case n of
                    ('\\') -> (fst p, 2, lst p)
                    (x)    -> (L.singleton n, 1, lst p)
                | snd p == 1 = case n of
                    ('\\') -> (fst p, 2, lst p)
                    (x)    -> ((fst p) `L.snoc` n, 1, lst p)
                | snd p == 2 = case n of
                    ('u')  ->  (fst p, 3, lst p)
                    x      ->  ((L.snoc (L.snoc (fst p) 
                                                '\\')
                                         n),
                                1, 
                                lst p)
                | snd p == 3 = proc p n
          proc :: (L.Text, Integer, L.Text) -> Char -> (L.Text, Integer, L.Text)
          proc (text, 3, buff) n | isHexDigit n           = (text, 3, buff `L.snoc` n)
                                 | (len > 3) && (len < 6) = (L.append text
                                                                      (replacedChoice buff n), 
                                                             if n == '\\' then 2 else 1,
                                                             L.empty)
                                 | otherwise              =  (L.append text 
                                                                       (L.append "\\u"
                                                                                  (choice buff n)),
                                                             if n == '\\' then 2 else 1,
                                                             L.empty) where
                                  len = L.length buff
                                  choice b n = if n == '\\' then b else L.snoc b n
                                  replacedChoice b n = if n == '\\' 
                                                       then repl b 
                                                       else L.snoc (repl b) n

  repl :: L.Text -> L.Text
  repl "" = ""
  repl s  = (\v -> case v of 
      (Right x) -> L.singleton $ (\t -> toEnum t :: Char) $ fst x
      (Left x) -> error $ "impossible" ++ (show x)) (hexadecimal s)

I am just running my tests (all is very fast, except this) to see time. Something like

it "converts url encoded string" $ do
        (ununicode $ BL.pack $ concat $ replicate 1000 ("error:\\u041d\\u0435\\u0432\\u0435\\u0440\\u043d\\u044b\\u0439\\u100cc" :: String))
         `shouldBe`
         ("error:\208\157\208\181\208\178\208\181\209\128\208\189\209\139\208\185")

It takes 0.5 sec for first implementation and from 5 to 15 for second. That's too much.

How can I make my second algorithm as fast as first (or even faster if it possible)?

Upvotes: 0

Views: 360

Answers (1)

ErikR
ErikR

Reputation: 52029

Fast string processing in Haskell is somewhat of a black art.

Are you trying to do the same thing as the jstring function in the aeson package?

https://hackage.haskell.org/package/aeson-0.11.2.0/docs/Data-Aeson-Parser.html#v:jstring

Even if it's not exactly the same thing you might get some ideas by looking its implementation. It uses attoparsec underneath which come in both strict and lazy versions.

Upvotes: 2

Related Questions