Optimizing a Trie

Posted on January 28, 2014

Tags: haskell

The other day I stumbled across some of old code for a trie. Being the hardworking and responsible student that I am, I immediately dropped all my homework to try and squeeze some performance out of it.

The code I ended up with was something like

    {-# LANGUAGE ViewPatterns, OverloadedStrings #-}
    import Data.List
    import Data.Maybe
    import qualified Data.Map.Strict as M
    import Data.Map.Strict ((!))
    import qualified Data.ByteString.Char8 as B

    newtype Trie = Node {subs :: M.Map Char Trie}
      deriving (Eq, Show, Ord)

    add :: B.ByteString -> Trie -> Trie
    add (B.uncons -> Just (c, s)) (Node subs) = Node
                                               . with
                                               . maybe (Node M.empty) id
                                               $ M.lookup c subs
      where with next = M.insert c (add s next) subs

    add _ trie = trie

    build :: [B.ByteString] -> Trie
    build = foldl' (flip add) (Node M.empty)

    contains :: B.ByteString -> Trie -> Bool
    contains (B.uncons -> Just (c, s)) (Node subs) = M.member c subs && contains s next
      where next = subs ! c
    contains _ _ = True

and so I decided to test it with a simple main

    main = do
      trie <- (build . B.lines) `fmap` B.readFile "/usr/share/dict/words"
      print $ contains "zebra" trie

and compiled it with the standard optimization flags

> ghc -O2 -fllvm trie.hs
> time ./trie
True

real 0m0.964s
user 0m0.896s
sys  0m0.059s

So pretty reasonably fast for about 500k words. Then to see if I could kick this in the teeth by forcing the entire trie I ran

    main = do
      words <- B.lines `fmap` B.readFile "/usr/share/dict/words"
      print $ all (flip contains $ build words) words

And again I ran it

> ghc -O2 -fllvm trie.hs
> time ./trie
True

real 0m1.557s
user 0m1.427s
sys  0m0.119s

Now this is weird. It seems that such building the trie is such a huge bottleneck that the building it takes around .9 seconds and querying it 500k times is only .6 seconds.

Now the next logical step was to see if maybe this was due to a build up of garbage that was dominating my time. A quick run with +RTS -s gives tells me that about 36% of my time is GC on both. This isn’t shocking since I’m building up a ton of data over a relatively short period of time.

So it’s not GC, and both operations are obviously O(n) (remembering that each map has 256 or less entries). So the only difference between them is that add allocates memory, a new map, and contains doesn’t. So let’s confirm our suspicions with a quick round of profiling.

 > ghc -O2 -fllvm -prof -auto-all trie.hs
 > ./trie

generates trie.prof with the profiling information for our runs.

They both inform us that about 97% of time and memory is spent in add. Now this seems odd, since our time difference between the runs was significant. I hypothesize that the reason for this is that when we make 500k queries we’re forcing the entire trie which is lazy. Indeed looking out the output from +RTS -s confirms that the 500k output allocates 1.69 mb of memory while only one query allocates 80 mb of memory.

So the bottleneck here is that allocation is slow as a dog. More to follow on optimizing this.