Optimising manipulation of large vectors

Question

This is a follow up to my previous question about processing a Vector representation of a 5.1m edge directed graph. I am trying to implement Kosaraju's graph algorithm and thus need to rearrange my Vector in the order of the finishing times of a depth first search (DFS) on the edges reversed. I have code that runs on small data sets but that fails to return in 10 minutes on the full data set. (I can't exclude that a loop arises from the big graph, but there are no signs of that on my test data.)

DFS needs to avoid revisiting nodes, so I need some sort of 'state' for the search (currently a tuple, should I use a State Monad?). The first search should return a reordered Vector, but I am keeping things simple at present by returning a list of the reordered Node indexes so that I can process the Vector in one go subsequently.

I presume the issue lies in dfsInner. The code below 'remembers' the nodes visited updating the explored field of each node (third guard). Although I tried to make it tail recursive, the code seems to grow memory use fairly fast. Do I need to enforce some strictness and if so, how? (I have another version that I use on a single search search, which checks for previous visits by looking at the start nodes of the unexplored edges on the stack and the list of nodes that have been completed. This does not grow so quickly, but does not return for any well connected node.)

However, it could also be the foldr', but how can I detect that?

This is supposedly Coursera homework, but I'm no longer sure I can tick the honour code button! Learning is more important though, so I don't really want a copy/paste answer. What I have is not very elegant - it has an imperative feel to it too, which is driven by the issue with keeping some sort of state - see third guard. I'd welcome comments on design patterns.

type NodeName = Int
type Edges    = [NodeName]
type Explored = Bool
type Stack    = [(Int, Int)]

data Node  = Node NodeName Explored Edges Edges deriving (Eq, Show)
type Graph = Vector Node

main = do
    edges <- V.fromList `fmap` getEdges "SCC.txt"
    let 
        maxIndex = fst $ V.last edges
        gr = createGraph maxIndex edges
        res = dfsOuter gr
    --return gr
    putStrLn $ show res

dfsOuter gr = 
    let tmp = V.foldr' callInner (gr,[]) gr
    in snd tmp

callInner :: Node -> (Graph, Stack) -> (Graph, Stack)
callInner (Node idx _ fwd bwd) (gr,acc) = 
    let (Node _ explored _ _) = gr V.! idx
    in case explored of
        True  -> (gr, acc)
        False ->
            let
                initialStack = map (\l -> (idx, l)) bwd
                gr' = gr V.// [(idx, Node idx True fwd bwd)]
                (gr'', newScc) = dfsInner idx initialStack (length acc) (gr', [])
            in (gr'', newScc++acc)

dfsInner :: NodeName -> Stack -> Int -> (Graph, [(Int, Int)]) -> (Graph, [(Int, Int)])
dfsInner start [] finishCounter (gr, acc) = (gr, (start, finishCounter):acc)
dfsInner start stack finishCounter (gr, acc)
    | nextStart /= start =                      -- no more places to go from this node
        dfsInner nextStart stack (finishCounter + 1) $ (gr, (start, finishCounter):acc)
    | nextExplored = 
-- nextExplored || any (\(y,_) -> y == stack0Head) stack || any (\(x,_) -> x == stack0Head) acc =
        dfsInner start (tail stack) finishCounter (gr, acc)
    | otherwise =
        dfsInner nextEnd (add2Stack++stack) finishCounter (gr V.// [(nextEnd, Node idx True nextLHS nextRHS)], acc)
--      dfsInner gr stack0Head (add2Stack++stack) finishCounter acc

    where
        (nextStart, nextEnd) = head stack
        (Node idx nextExplored nextLHS nextRHS) = gr V.! nextEnd
        add2Stack = map (\l -> (nextEnd, l)) nextRHS

Simon H · Accepted Answer

Based on @andras gist, I rewrote my code as below. I did not use Arrow functions as I am unfamiliar with them, and my second depth first search is stylistically the same as the first one (instead of @Andras filterM approach). The end result is that it completes in 20% of the time of Andras' code (21s instead of 114s).

import qualified Data.Vector as V
import qualified Data.IntSet as IS
import qualified Data.ByteString.Char8 as BS
import Data.List
import Control.Monad
import Control.Monad.State
--import Criterion.Main

--getEdges :: String -> IO [(Int, Int)]
getEdges file = do
    lines <- (map BS.words . BS.lines) `fmap` BS.readFile file
    let 
        pairs = (map . map) (maybe (error "can't read Int") fst . BS.readInt) lines
        pairs' = [(a, b) | [a, b] <- pairs]         -- adds 9 seconds
        maxIndex = fst $ last pairs'
        graph = createGraph maxIndex pairs'
    return graph

main = do
    graph <- getEdges "SCC.txt"
    --let 
        --maxIndex = fst $ V.last edges
    let 
        fts = bwdLoop graph
        leaders = fst $ execState (fwdLoop graph fts) ([], IS.empty)
    print $ length leaders

type Connections = [Int]
data Node = Node {fwd, bwd :: Connections} deriving (Show)
type Graph = V.Vector Node

type Visited = IS.IntSet
type FinishTime = Int
type FinishTimes = [FinishTime]
type Leaders = [Int]

createGraph :: Int -> [(Int, Int)] -> Graph
createGraph maxIndex pairs = 
    let
        graph  = V.replicate (maxIndex+1) (Node [] [])
        graph' = V.accum (\(Node f b) x -> Node (x:f) b) graph  pairs
    in           V.accum (\(Node f b) x -> Node f (x:b)) graph' $ map (\(a,b) -> (b,a)) pairs

bwdLoop :: Graph -> FinishTimes
bwdLoop g = fst $ execState (mapM_ go $ reverse [0 .. V.length g - 1]) ([], IS.empty) where
    go :: Int -> State (FinishTimes, Visited) ()
    go i = do
        (fTimes, vs) <- get
        let visited = IS.member i vs
        if not visited then do
            put (fTimes, IS.insert i vs)
            mapM_ go $ bwd $ g V.! i
            -- get state again after changes from mapM_
            (fTimes', vs') <- get
            put (i : fTimes', vs')
        else return ()

fwdLoop :: Graph -> FinishTimes -> State (Leaders, Visited) ()
fwdLoop _ [] = return ()
fwdLoop g (i:fts) = do
    (ls, vs) <- get
    let visited = IS.member i vs
    if not visited then do
        put (i:ls, IS.insert i vs)
        mapM_ go $ fwd $ g V.! i
    else return ()
    fwdLoop g fts

    where
        go :: Int -> State (Leaders, Visited) ()
        go i = do
            (ls, vs) <- get
            let visited = IS.member i vs
            if not visited then do
                put (ls, IS.insert i vs)
                mapM_ go $ fwd $ g V.! i
            else return ()

Optimising manipulation of large vectors

Answers (2)

Related Questions