Reputation: 99
I am working on function to convert javascript encoded unicode string from server api to utf8.
I have 2 approaches. The first is not general enough, and second is too slow. How can I do it both fast and for all unicode symbols?
First is just replacing some substring using Map
ununicode :: BL.ByteString -> BL.ByteString
ununicode s = LE.encodeUtf8 $ replace $ LE.decodeUtf8 s where
replace :: L.Text -> L.Text
replace "" = ""
replace string = case Map.lookup (L.take 6 string) table of
(Just x) -> L.append x (replace $ L.drop 6 string)
Nothing -> L.cons (L.head string) (replace $ L.tail string)
table = Map.fromList $ zip letters rus
rus = ["Ё", "ё", "А", "Б", "В", "Г", "Д", "Е", "Ж", "З", "И", "Й", "К", "Л", "М",
"Н", "О", "П", "Р", "С", "Т", "У", "Ф", "Х", "Ц", "Ч", "Ш", "Щ", "Ъ", "Ы",
"Ь", "Э", "Ю", "Я", "а", "б", "в", "г", "д", "е", "ж", "з", "и", "й", "к",
"л", "м", "н", "о", "п", "р", "с", "т", "у", "ф", "х", "ц", "ч", "ш", "щ",
"ъ", "ы", "ь", "э", "ю", "я", "—"] :: [L.Text]
letters = ["\\u0401", "\\u0451", "\\u0410", "\\u0411", "\\u0412", "\\u0413",
"\\u0414", "\\u0415", "\\u0416", "\\u0417", "\\u0418", "\\u0419",
"\\u041a", "\\u041b", "\\u041c", "\\u041d", "\\u041e", "\\u041f",
"\\u0420", "\\u0421", "\\u0422", "\\u0423", "\\u0424", "\\u0425",
"\\u0426", "\\u0427", "\\u0428", "\\u0429", "\\u042a", "\\u042b",
"\\u042c", "\\u042d", "\\u042e", "\\u042f", "\\u0430", "\\u0431",
"\\u0432", "\\u0433", "\\u0434", "\\u0435", "\\u0436", "\\u0437",
"\\u0438", "\\u0439", "\\u043a", "\\u043b", "\\u043c", "\\u043d",
"\\u043e", "\\u043f", "\\u0440", "\\u0441", "\\u0442", "\\u0443",
"\\u0444", "\\u0445", "\\u0446", "\\u0447", "\\u0448", "\\u0449",
"\\u044a", "\\u044b", "\\u044c", "\\u044d", "\\u044e", "\\u044f",
"\\u2014"] :: [L.Text]
And in second I use finite automata in foldl function. (I wanted to use regexp but found no lib that supports function instead of string in sub like in python https://docs.python.org/3/library/re.html#re.sub)
ununicode :: BL.ByteString -> BL.ByteString
ununicode s = LE.encodeUtf8 $ parts $ LE.decodeUtf8 s where
parts :: L.Text -> L.Text
parts = fst . parts' where
lst (_, _, x) = x
snd (_, x, _) = x
fst (x, _, _) = x
parts' :: L.Text -> (L.Text, Integer, L.Text)
parts' = L.foldl f ("", 0, "") where
f :: (L.Text, Integer, L.Text) -> Char -> (L.Text, Integer, L.Text)
f p n | snd p == 0 = case n of
('\\') -> (fst p, 2, lst p)
(x) -> (L.singleton n, 1, lst p)
| snd p == 1 = case n of
('\\') -> (fst p, 2, lst p)
(x) -> ((fst p) `L.snoc` n, 1, lst p)
| snd p == 2 = case n of
('u') -> (fst p, 3, lst p)
x -> ((L.snoc (L.snoc (fst p)
'\\')
n),
1,
lst p)
| snd p == 3 = proc p n
proc :: (L.Text, Integer, L.Text) -> Char -> (L.Text, Integer, L.Text)
proc (text, 3, buff) n | isHexDigit n = (text, 3, buff `L.snoc` n)
| (len > 3) && (len < 6) = (L.append text
(replacedChoice buff n),
if n == '\\' then 2 else 1,
L.empty)
| otherwise = (L.append text
(L.append "\\u"
(choice buff n)),
if n == '\\' then 2 else 1,
L.empty) where
len = L.length buff
choice b n = if n == '\\' then b else L.snoc b n
replacedChoice b n = if n == '\\'
then repl b
else L.snoc (repl b) n
repl :: L.Text -> L.Text
repl "" = ""
repl s = (\v -> case v of
(Right x) -> L.singleton $ (\t -> toEnum t :: Char) $ fst x
(Left x) -> error $ "impossible" ++ (show x)) (hexadecimal s)
I am just running my tests (all is very fast, except this) to see time. Something like
it "converts url encoded string" $ do
(ununicode $ BL.pack $ concat $ replicate 1000 ("error:\\u041d\\u0435\\u0432\\u0435\\u0440\\u043d\\u044b\\u0439\\u100cc" :: String))
`shouldBe`
("error:\208\157\208\181\208\178\208\181\209\128\208\189\209\139\208\185")
It takes 0.5 sec for first implementation and from 5 to 15 for second. That's too much.
How can I make my second algorithm as fast as first (or even faster if it possible)?
Upvotes: 0
Views: 360
Reputation: 52029
Fast string processing in Haskell is somewhat of a black art.
Are you trying to do the same thing as the jstring
function in the aeson
package?
https://hackage.haskell.org/package/aeson-0.11.2.0/docs/Data-Aeson-Parser.html#v:jstring
Even if it's not exactly the same thing you might get some ideas by looking its implementation. It uses attoparsec underneath which come in both strict and lazy versions.
Upvotes: 2