oknsnl
oknsnl

Reputation: 351

My attempt at Project Euler #92 is too slow

I'm trying to solve Project Euler problem #92 with Haskell. I started learning Haskell recently. It's the first Project Euler problem I've tried to solve with Haskell, but my piece of code doesn't terminate even in 10 minutes. I know you don't give me the answer directly, but again I should warn I find answer with c++ doesn't give answer of Euler or new logic to solve Euler. I'm just curious why that guy doesn't work fast and what should I do to make it faster?

{--EULER 92--}
import Data.List


myFirstFunction 1 = 0
myFirstFunction 89 = 1
myFirstFunction x= myFirstFunction (giveResult x)

giveResult 0 = 0
giveResult x = (square (mod x 10)) + (giveResult (div x 10))

square x = x*x


a=[1..10000000]


main = putStrLn(show (sum (map myFirstFunction a))) 

Upvotes: 3

Views: 834

Answers (2)

Daniel Fischer
Daniel Fischer

Reputation: 183873

The biggest speedup can of course be gained from using a better algorithm. I'm not going deep into that here, though.

Original algorithm tweakings

So let's focus on improving the used algorithm without really changing it.

  1. You never give any type signature, therefore the type defaults to the arbitrary precision Integer. Everything here fits easily in an Int, there's no danger of overflow, so let's use that. Adding a type signature myFirstFunction :: Int -> Int helps: time drops from Total time 13.77s ( 13.79s elapsed) to Total time 6.24s ( 6.24s elapsed) and total allocation drops by a factor of about 15. Not bad for such a simple change.

  2. You use div and mod. These always compute a non-negative remainder and the corresponding quotient, so they need some extra checks in case some negative numbers are involved. The functions quot and rem map to the machine division instructions, they don't involve such checks and therefore are somewhat faster. If you compile via the LLVM backend (-fllvm), that also takes advantage of the fact that you always divide by a single known number (10), and converts the division into multiplication and bit-shift. Time now: Total time 1.56s ( 1.56s elapsed).

  3. Instead of using quot and rem separately, let's use the quotRem function that computes both at once, so that we don't repeat the division (even with multiplication+shift that takes a little time):

    giveResult x = case x `quotRem` 10 of
                     (q,r) -> r*r + giveResult q
    

    That doesn't gain much, but a little: Total time 1.49s ( 1.49s elapsed).

  4. You're using a list a = [1 .. 10000000], and map the function over that list and then sum the resulting list. That's idiomatic, neat and short, but not super fast, since allocating all those list cells and garbage collecting them takes time too - not very much, since GHC is very good at that, but transforming it into a loop

    main = print $ go 0 1
        where
            go acc n
                | n > 10000000 = acc
                | otherwise    = go (acc + myFirstFunction n) (n+1)
    

    gains us a little still: Total time 1.34s ( 1.34s elapsed) and the allocation dropped from 880,051,856 bytes allocated in the heap for the last list version to 51,840 bytes allocated in the heap.

  5. giveResult is recursive, and therefore cannot be inlined. The same holds for myFirstFunction, hence each computation needs two function calls (at least). We can avoid that by rewriting giveResult to a non-recursive wrapper and a recursive local loop,

    giveResult x = go 0 x
        where
            go acc 0 = acc
            go acc n = case n `quotRem` 10 of
                         (q,r) -> go (acc + r*r) q
    

    so that that can be inlined: Total time 1.04s ( 1.04s elapsed).

Those were the most obvious points, further improvements - apart from the memoisation mentioned by hammar in the comments - would need some thinking.

We are now at

module Main (main) where

myFirstFunction :: Int -> Int
myFirstFunction 1 = 0
myFirstFunction 89 = 1
myFirstFunction x= myFirstFunction (giveResult x)

giveResult :: Int -> Int
giveResult x = go 0 x
    where
        go acc 0 = acc
        go acc n = case n `quotRem` 10 of
                     (q,r) -> go (acc + r*r) q

main :: IO ()
main = print $ go 0 1
    where
        go acc n
            | n > 10000000 = acc
            | otherwise    = go (acc + myFirstFunction n) (n+1)

With -O2 -fllvm, that runs in 1.04 seconds here, but with the native code generator (only -O2), it takes 3.5 seconds. That difference is due to the fact that GHC itself doesn't convert the division into a multiplication and bit-shift. If we do it by hand, we get pretty much the same performance from the native code generator.

Because we know something that the compiler doesn't, namely that we never deal with negative numbers here, and the numbers don't become large, we can even generate a better multiply-and-shift (that would produce wrong results for negative or large dividends) than the compiler and take the time down to 0.9 seconds for the native code generator and 0.73 seconds for the LLVM backend:

import Data.Bits

qr10 :: Int -> (Int, Int)
qr10 n = (q, r)
  where
    q = (n * 0x66666667) `unsafeShiftR` 34
    r = n - 10 * q

Note: That requires that Int is a 64-bit type, it won't work with 32-bit Ints, it will produce wrong results for negative n, and the multiplication will overflow for large n. We're entering dirty-hack territory. We can alleviate the dirtyness by using Word instead of Int, that leaves only the overflow (which doesn't occur for n <= 10737418236 with Word resp n <= 5368709118 for Int, so here we are comfortably in the safe zone). The times aren't affected.

The corresponding C programme

#include <stdio.h>

unsigned int myFirstFunction(unsigned int i);
unsigned int giveResult(unsigned int i);

int main(void) {
    unsigned int sum = 0;
    for(unsigned int i = 1; i <= 10000000; ++i) {
        sum += myFirstFunction(i);
    }
    printf("%u\n",sum);
    return 0;
}

unsigned int myFirstFunction(unsigned int i) {
    if (i == 1) return 0;
    if (i == 89) return 1;
    return myFirstFunction(giveResult(i));
}

unsigned int giveResult(unsigned int i) {
    unsigned int acc = 0, r, q;
    while(i) {
        q = (i*0x66666667UL) >> 34;
        r = i - q*10;
        i = q;
        acc += r*r;
    }
    return acc;
}

performs similarly, compiled with gcc -O3, it runs in 0.78 seconds, and with clang -O3 in 0.71.

That's pretty much the end without changing the algorithm.


Memoisation

Now, a minor change of algorithm is memoisation. If we build a lookup table for the numbers <= 7*9², we need only one computation of the sum of the squares of the digits for each number rather than iterating that until we reach 1 or 89, so let's memoise,

module Main (main) where

import Data.Array.Unboxed
import Data.Array.IArray
import Data.Array.Base (unsafeAt)
import Data.Bits

qr10 :: Int -> (Int, Int)
qr10 n = (q, r)
  where
    q = (n * 0x66666667) `unsafeShiftR` 34
    r = n - 10 * q

digitSquareSum :: Int -> Int
digitSquareSum = go 0
  where
    go acc 0 = acc
    go acc n = case qr10 n of
                 (q,r) -> go (acc + r*r) q

table :: UArray Int Int
table = array (0,567) $ assocs helper
  where
    helper :: Array Int Int
    helper = array (0,567) [(i, f i) | i <- [0 .. 567]]
    f 0 = 0
    f 1 = 0
    f 89 = 1
    f n = helper ! digitSquareSum n

endPoint :: Int -> Int
endPoint n = table `unsafeAt` digitSquareSum n

main :: IO ()
main = print $ go 0 1
  where
    go acc n
        | n > 10000000 = acc
        | otherwise    = go (acc + endPoint n) (n+1)

Doing the memoisation by hand instead of using a library makes the code longer, but we can tailor it to our needs. We can use an unboxed array, and we can omit the bounds check on the array access. Both significantly speed the computation up. The time is now 0.18 seconds for the native code generator, and 0.13 seconds for the LLVM backend. The corresponding C programme runs in 0.16 seconds compiled with gcc -O3, and 0.145 seconds compiled with clang -O3 (Haskell beats C, w00t!).


Scaling and a hint for a better algorithm

The used algorithm however doesn't scale too well, a bit worse than linear, and for an upper bound of 108 (with suitably adapted memoisation limit), it runs in 1.5 seconds (ghc -O2 -fllvm), resp. 1.64 seconds (clang -O3) and 1.87 seconds (gcc -O3) [2.02 seconds for the native code generator].

Using a different algorithm that counts the numbers whose sequence ends in 1 by partitioning such numbers into a sum of squares of digits (The only numbers that directly produce 1 are powers of 10. We can write

10 = 1×3² + 1×1²
10 = 2×2² + 2×1²
10 = 1×2² + 6×1²
10 = 10×1²

From the first, we obtain 13, 31, 103, 130, 301, 310, 1003, 1030, 1300, 3001, 3010, 3100, ... From the second, we obtain 1122, 1212, 1221, 2112, 2121, 2211, 11022, 11202, ... From the third 1111112, 1111121, ...

Only 13, 31, 103, 130, 301, 310 are possible sums of squares of the digits of numbers <= 10^10, so only those need to be investigated further. We can write

100 = 1×9² + 1×4² + 3×1²
...
100 = 1×8² + 1×6²
...

The first of these partitions generates no children since it requires five nonzero digits, the other explicitly given generates the two children 68 and 86 (also 608 if the limit is 108, more for larger limits)), we can get better scaling and a faster algorithm.

The fairly unoptimised programme I wrote way back when to solve this problem runs (input is exponent of 10 of the limit)

$ time ./problem92 7
8581146

real    0m0.010s
user    0m0.008s
sys     0m0.002s
$ time ./problem92 8
85744333

real    0m0.022s
user    0m0.018s
sys     0m0.003s
$ time ./problem92 9
854325192

real    0m0.040s
user    0m0.033s
sys     0m0.006s
$ time ./problem92 10
8507390852

real    0m0.074s
user    0m0.069s
sys     0m0.004s

in a different league.

Upvotes: 23

hammar
hammar

Reputation: 139830

First off, I took the liberty of cleaning up your code a little:

endsAt89 1  = 0
endsAt89 89 = 1
endsAt89 n  = endsAt89 (sumOfSquareDigits n)

sumOfSquareDigits 0 = 0
sumOfSquareDigits n = (n `mod` 10)^2 + sumOfSquareDigits (n `div` 10)    

main = print . sum $ map endsAt89 [1..10^7]

On my crappy netbook is 1 min 13 sec. Let's see if we can improve that.

Since the numbers are small, we can start by using machine-sized Int instead of arbitrary-size Integer. This is just a matter of adding type signatures, e.g.

sumOfSquareDigits :: Int -> Int

That improves the run time drastically to 20 seconds.

Since the numbers are all positive, we can replace div and mod with the slightly faster quot and rem, or even both in one go with quotRem:

sumOfSquareDigits :: Int -> Int
sumOfSquareDigits 0 = 0
sumOfSquareDigits n = r^2 + sumOfSquareDigits q
  where (q, r) = quotRem x 10

Run time is now 17 seconds. Making it tail recursive shaves off another second:

sumOfSquareDigits :: Int -> Int
sumOfSquareDigits n = loop n 0
  where
    loop 0 !s = s
    loop n !s = loop q (s + r^2)
      where (q, r) = quotRem n 10

For further improvements, we can notice that sumOfSquareDigits returns at most 567 = 7 * 9^2 for the given input numbers, so we can memoize for small numbers to reduce the number of iterations needed. Here's my final version (using the data-memocombinators package for the memoization):

{-# LANGUAGE BangPatterns #-}
import qualified Data.MemoCombinators as Memo

endsAt89 :: Int -> Int
endsAt89 = Memo.arrayRange (1, 7*9^2) endsAt89'
  where
    endsAt89' 1  = 0
    endsAt89' 89 = 1
    endsAt89' n  = endsAt89 (sumOfSquareDigits n)

sumOfSquareDigits :: Int -> Int
sumOfSquareDigits n = loop n 0
  where
    loop 0 !s = s
    loop n !s = loop q (s + r^2)
      where (q, r) = quotRem n 10

main = print . sum $ map endsAt89 [1..10^7]

This runs in just under 9 seconds on my machine.

Upvotes: 9

Related Questions