dsvjksv
dsvjksv

Reputation: 29

Haskell text encoder

I am new to Haskell and would like some direction to solving my problem. I wanted to have a text encode function that list in which each word of the text is represented by its index. For e.g. :

["The more I like, the more I love.","The more I love, the more I hate."]

the output might be

   (["The", "more", "I", "like", "the", "love.", "love,", "hate."],
   [1, 2, 3, 4, 5, 2, 3, 6, 1, 2, 3, 7, 1, 2, 3, 8])

I have done the remove duplication part

removeDuplicates :: Eq a => [a] -> [a]
removeDuplicates = rdHelper []
  where rdHelper seen [] = seen
          rdHelper seen (x:xs)
            | x `elem` seen = rdHelper seen xs
            | otherwise = rdHelper (seen ++ [x]) xs

Upvotes: 2

Views: 171

Answers (2)

Redu
Redu

Reputation: 26201

I guess the Data.Map and Data.Set packages are ideal tools to solve this job efficiently. My implementation would be as follows;

import qualified Data.Map.Lazy as Map
import qualified Data.Set as Set

encode :: [String] -> ([String],[[Int]])
encode wss = let dict = Map.fromList . zip (Set.toList . Set.unions . map (Set.fromList . words) $ wss) $ [1..]
             in (map fst $ Map.toList dict, map (map (flip (Map.findWithDefault 0) dict) . words) wss)

*Main> encode ["Are you allright", "Hey there how are you", "Hello there", "Do you like coffee"]
(["Are","Do","Hello","Hey","allright","are","coffee","how","like","there","you"],[[1,11,5],[4,10,8,6,11],[3,10],[2,11,9,7]])

Upvotes: 1

Igor Drozdov
Igor Drozdov

Reputation: 15045

You can just iterate over the list of words and accumulate the unique words and its indexes. If the element is in the accumulated list, append the index to the accumulated list of indexes. If the element isn't in the list, append the new index (length of the list of words + 1).

To be honest, Haskell code is more understandable, than my description:

import Data.List (findIndex)

build :: ([String], [Int]) -> String -> ([String], [Int])
build (words, indexes) word =
  let
    maybeIndex = findIndex (== word) words
  in
    case maybeIndex of
      Just index ->
        (words, indexes ++ [index + 1])
      Nothing ->
        (words ++ [word], indexes ++ [(+1) . length $ words])

buildIndexes =
  let
    listOfWords = words "The more I like, the more I love. The more I love, the more I hate."
  in
    foldl build ([], []) listOfWords

Here I have a concatenated string as an input

"The more I like, the more I love. The more I love, the more I hate."

Feel free to tailor the code for your needs.

By the way, it might be more performant to insert the elements at the beginning of the lists and then reverse the resulting lists.

import Data.List (findIndex)

build :: ([String], [Int]) -> String -> ([String], [Int])
build (words, indexes) word =
  let
    maybeIndex = findIndex (== word) words
  in
    case maybeIndex of
      Just index ->
        (words, (index + 1) : indexes)
      Nothing ->
        (word : words, ((+1) . length $ words) : indexes)

buildIndexes =
  let
    listOfWords = words "The more I like, the more I love. The more I love, the more I hate."
    (listOfUniqueWords, listOfIndexes) = foldl build ([], []) listOfWords
  in
    (reverse listOfUniqueWords, reverse listOfIndexes)

Upvotes: 1

Related Questions